-
-
Notifications
You must be signed in to change notification settings - Fork 362
Description
Describe the Feature
The feature is a new automated CI/CD pipeline designed to proactively detect Terraform configuration drift in the EKS infrastructure managed by this module.
It consists of a scheduled GitHub Actions workflow that will:
- Automatically deploy the reference example of this module to a test environment.
- Execute a Terraform plan to compare the actual state of the deployed infrastructure against the state defined by the module's code.
- Serve as an early warning system by failing the workflow and generating a notification if any divergence (drift) is detected
Expected Behavior
- On Schedule: The new workflow (e.g., drift-detection) triggers automatically based on a defined schedule (e.g., weekly).
- Successful Deployment: The workflow checks out the code, sets up Terraform, and successfully deploys the module's example configuration (examples/complete) to a test AWS account, using the existing CI secrets and practices.
- Drift Check: The workflow executes terraform plan -detailed-exitcode. The command exits with code 0 if no drift is detected.
- Green Check: If no drift is found, the workflow completes successfully, providing a green check mark and confidence that the infrastructure state is correct.
- Drift Detected - Fail & Alert: If drift is detected (exit code 2), the workflow fails conspicuously. This failure should be configured to:
- Send a notification to a Slack channel (via existing integrations).
Use Case
Yes. Terraform configuration drift is a significant operational risk, especially for critical infrastructure like EKS clusters. Drift can occur due to:
- Manual changes made directly in the AWS console.
- Other scripts or tools modifying resources outside of Terraform.
- Changes in the AWS provider behavior or APIs.
- For a widely-used module like this, drift can lead to:
- Unexpected costs: (e.g., an accidentally changed instance type).
- Security risks: (e.g., a security group rule was manually opened).
- Deployment failures: (e.g., a future terraform apply fails because it tries to revert a manual change that the team relies on).
Describe Ideal Solution
I propose adding a new GitHub Actions workflow (e.g., .github/workflows/drift-detection.yml) that performs the following:
- Schedule: Runs on a regular schedule (e.g., once a week via schedule:).
- Deployment: Uses the module's own CI/CD setup (like the existing test workflow) to deploy the example configuration (examples/complete) to a test AWS account.
- Detection: Runs terraform plan -detailed-exitcode. A non-zero exit code indicates drift.
- Notification: If drift is detected, the workflow fails. This failure can be configured to send a notification to Slack or create a GitHub Issue, alerting maintainers that the deployed infrastructure no longer matches the Terraform state.
Alternatives Considered
Relying on users to implement this themselves:
Why not chosen: Most users won't do this. By building it directly into the module's CI, we provide immense value "out of the box" and set a best-practice standard for the entire community using this module.
Automated remediation (terraform apply):
Why not chosen: Automatically applying changes in a shared module's CI is far too dangerous and could itself cause outages. Manual review upon detection is the only safe approach.
Additional Context
I am prepared to contribute the code for this feature via a Pull Request if the maintainers are open to the idea.