-
Notifications
You must be signed in to change notification settings - Fork 12
AWS monitor heartbeat
Objective: Continuously use gs-netcat
to test all GSRN servers (within and outside of AWS). On failure notify the admin. This is a fully functional test. For non-functional tests and for general system metrics (like CPU and Memory usage) use NetData instead.
The functional test heartbeat.sh
is embedded within a docker image.
- Use AWS Elastic Container Registry (ECR) to store the docker image
- Use AWS Fargate to run the docker image
- Use AWS Cloudewatch to send notification if GSRN goes bad
We use AWS region us-east-2.
Select IAM
-> Policies
- Select
Create Policy
. - Under Service select
Elastic Container Registry
. - Select
All Elastic Container Registry actions (ecr:*)
- Under
Resources
selectSpecific
andAdd ARN
and forRepository name
selectAny
. - Click
Next: Tags
andNext: Review
. - Under Name specify
ECR_FullAccess
(or any other name you like). - Click
Create policy
.
Create a new user under IAM
-> Users
- Select
Add users
and name the new userfargate_user
. - Select
Programmatic access
andAWS Management Console Access
. - De-select
User must create a new password at next sign-in
. - Click
Next: Permissions
.
- Select
Attach existing policies directly
and selectAmazonECS_FullAccess
andAmazonEC2ContainerRegistryPowerUser
andECR_FullAccess
. - Click
Next: Tags
and thenNext: Review
and thenCreate user
. - Note down the
Account ID
,Access key ID
,Secret access key
andPassword
.
Select Elastic Container Registry
.
- Select
Create repository
and fill in the name of the repository asgsrn_heartbeat
and leave everything else default. - Note down the
URI
(e.g. [account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat).
Sign in to AWS using the credentials of fargate_user
:
aws configure
Retrieve the AWS login password and ready docker to sign in to the ECR Registry:
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin [account ID].dkr.ecr.us-east-2.amazonaws.com
Create the Docker image
docker build -t gsrn_hb .
TAG the docker image with the URI
.
docker tag gsrn_hb [account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat
Push the tagged image
docker push [account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat
With your AWS user (not fargate_user
):
- Go to
Elastic Container Service (ECS)
. - Select
Create Cluster
andNetworking only
. - Name the cluster
fargate-gsrn-heartbeat
and leave the rest as is.
- Select
Create new Task Definition
andFargate
. - Enter the name for task (
GSRN heartbeat
). - Select
0.5
for Task Memory and0.25
vCPU for Task CPU.
Under Container definition select Add Container
.
- We use
GSRN-Heartbeat
as Container name. - Enter the ARN of the docker image
[account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat
- Select
Linux
for Operationg system family. - Under Command enter
gs1.thc.org gs2.thc.rog gs3.thc.org gs4.thc.org gs5.thc.org
. - Select
Add
and thenCreate
.
Select the GSRNHeartbeat:1
task and under Action select Run Task
.
- Set Launch Type to
FARGATE
- Select any Cluster VPC and any Subnets.
- Select
Run Task
The steps are:
- Create a Topic under AWS Simple Notification Service (SNS) and add an email address to it.
- Create a filter that matches a string from the log file and records every match to a metric.
- Create an alarm if a change in the metric is detected.
Go to Simple Notification Service
-> Create Topic
- Use
Standard
- Set the Name to
GSRN-Heartbeat
- Click
Create topic
at the bottom right.
Create a subscription
- Set Protocol to
Email
- Under Endpoint enter the email address. I use
root@thc.org
. - Select
Create Subscription
at the bottom right.
It is not directly possible to match a pattern in a log file and raise an alarm (e.g. send an email). Instead a pattern is matched and a metric is created. Then an alarm can be triggered when the metric changes.
The heartbeat docker instance creates a log entry with "OK_COUNT=" every 60 seconds (unless GSRN is failing). Create a metric entry every time this pattern is encountered (1 every 60 seconds).
Go to Logs
-> Log Group
- Select the log group
/ecs/GSRN-heartbeat
- Click
Action
->Create Metric Filter
- Under Filter Pattern enter
OK_COUNT=
and clickNext
. - Under Filter Name write
OK-COUNT-FILTER
Under Metric Details set
- Metric namespace to
GSRN Heartbeat
- Metric name to
OK-COUNT
- Metric value to
1
- Default value to
0
- Click on
Next
and thenCreate metric filter
at the bottom right.
Go to Alarm
-> All Alarms
-> Create alarm
-> Select metric
- Under Custom namespace click on
GSRN Heartbeat
. - Click on
Metrics with no dimensions
. - Select
OK-COUNT
- Click on
Select metric
at the bottom right. - Under Conditions select
Lower
and under than... write1
. - Under Additional configuration -> Missing data treatment select
Treat missing data as bad (breaching threshold)
. - Select
Next
at the bottom right. - Under Send a notification to... select
GSRN-Heartbeat
1. - Click
Next
at the bottom right.
Under Add name and description:
- Set Alarm Name to
FAILED GSRN HEARTBEAT
. - Set Alarm description to
GSRN Server may be down. Check Cloudwatch log.
. - Click
Next
at the bottom right and thenCreate alarm
.
TODO:
- How to allow task to have NAT access to Internet without assigning a public IP and without creating my own NAT GW?
- How to include the log entry that triggered the metrics alarm?
Helpful links: