Skip to content

Proposal: Add automated drift detection CI workflow #257

@OSkull32

Description

@OSkull32

Describe the Feature

The feature is a new automated CI/CD pipeline designed to proactively detect Terraform configuration drift in the EKS infrastructure managed by this module.

It consists of a scheduled GitHub Actions workflow that will:

  1. Automatically deploy the reference example of this module to a test environment.
  2. Execute a Terraform plan to compare the actual state of the deployed infrastructure against the state defined by the module's code.
  3. Serve as an early warning system by failing the workflow and generating a notification if any divergence (drift) is detected

Expected Behavior

  1. On Schedule: The new workflow (e.g., drift-detection) triggers automatically based on a defined schedule (e.g., weekly).
  2. Successful Deployment: The workflow checks out the code, sets up Terraform, and successfully deploys the module's example configuration (examples/complete) to a test AWS account, using the existing CI secrets and practices.
  3. Drift Check: The workflow executes terraform plan -detailed-exitcode. The command exits with code 0 if no drift is detected.
  4. Green Check: If no drift is found, the workflow completes successfully, providing a green check mark and confidence that the infrastructure state is correct.
  5. Drift Detected - Fail & Alert: If drift is detected (exit code 2), the workflow fails conspicuously. This failure should be configured to:
  6. Send a notification to a Slack channel (via existing integrations).

Use Case

Yes. Terraform configuration drift is a significant operational risk, especially for critical infrastructure like EKS clusters. Drift can occur due to:

  1. Manual changes made directly in the AWS console.
  2. Other scripts or tools modifying resources outside of Terraform.
  3. Changes in the AWS provider behavior or APIs.
  4. For a widely-used module like this, drift can lead to:
  5. Unexpected costs: (e.g., an accidentally changed instance type).
  6. Security risks: (e.g., a security group rule was manually opened).
  7. Deployment failures: (e.g., a future terraform apply fails because it tries to revert a manual change that the team relies on).

Describe Ideal Solution

I propose adding a new GitHub Actions workflow (e.g., .github/workflows/drift-detection.yml) that performs the following:

  1. Schedule: Runs on a regular schedule (e.g., once a week via schedule:).
  2. Deployment: Uses the module's own CI/CD setup (like the existing test workflow) to deploy the example configuration (examples/complete) to a test AWS account.
  3. Detection: Runs terraform plan -detailed-exitcode. A non-zero exit code indicates drift.
  4. Notification: If drift is detected, the workflow fails. This failure can be configured to send a notification to Slack or create a GitHub Issue, alerting maintainers that the deployed infrastructure no longer matches the Terraform state.

Alternatives Considered

Relying on users to implement this themselves:

Why not chosen: Most users won't do this. By building it directly into the module's CI, we provide immense value "out of the box" and set a best-practice standard for the entire community using this module.

Automated remediation (terraform apply):

Why not chosen: Automatically applying changes in a shared module's CI is far too dangerous and could itself cause outages. Manual review upon detection is the only safe approach.

Additional Context

I am prepared to contribute the code for this feature via a Pull Request if the maintainers are open to the idea.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions