ASCENDING Technical Blog

Spot Instance 2 minutes termination

Written by Celeste Shao | Aug 9, 2017 9:10:34 PM

We have talked about spot instance to save significantly on EC2 spending at running your stack on Spot Instance and running Spot Fleet + AutoScaling and High Availability. However, spot instance will be recovered by AWS inevitably.  AWS has announced a spot Instance 2 minutes termination notices back in 2015. Although it is not intuitive, it’s one of few way to prevent data loss from accidentally termination in spot.  In this post, we will walk through how to leverage spot Instance 2 minutes termination policy to transit workload without interruption.

Run workload on spot instances

There are a few ways to launch spot instances. It can be launched as single spot request, launched as a fleet to maintain a capacity of workload or reserved for a particular duration. It can be powerful to combine a single spot request with autoscaling group, and fail back regular EC2 instance. You can see example here. If the workload is instance flexible, a even better choice is to request spot fleet with combination of various instance type to maintain a capacity. See example here.

In the example, we ran an web application on spot instances. If we can run do that, we can pretty much run most of workload on spot instances. Perhaps we were lucky, we never had application outage in last 24 months.  We think the key is instance flexibility in spot market. If certain type of instance demand spike in the market, we can move workload to other instance type with less demand in the market.

 

Spot Instance 2 minutes termination

There are two ways to get spot instance 2 minutes termination. We can either get from EC2 metadata or CloudTrail event.

EC2 Metadata

When a spot instance is at pending_termination status, AWS will post future termination timestamp at http://169.254.169.254/latest/meta-data/spot/termination-time. The sample data looks like 2017-08-09T15:13:08Z. According to the AWS reference doc, we can grab the timestamp with following command to determine if the instance is in pending_termination status.

curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z;

CloudTrail Event

CloudTrail is particular useful to monitor api usage in AWS account. It is a log tool for all AWS api call. If one is first time user, he/she needs to follow quick 5 minutes reference to configure CloudTrail. Once the CloudTrail has been initialized, we should also receive an event with name “TerminateInstances” whenever a spot instance is about to terminate. Here is sample event json:

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "Root",
        "principalId": "5953xxxxxxxx",
        "arn": "arn:aws:iam::5953xxxxxxxx:root",
        "accountId": "5953xxxxxxxx",
        "accessKeyId": "xxxxxxxxxxxxxx",
        "sessionContext": {
            "attributes": {
                "mfaAuthenticated": "false",
                "creationDate": "2017-08-07T13:26:23Z"
            }
        }
    },
    "eventTime": "2017-08-07T13:30:40Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "TerminateInstances",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "74.xx.xxx.xx",
    "userAgent": "console.ec2.amazonaws.com",
    "requestParameters": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-xxxxxxxxxxx"
                }
            ]
        }
    },
    "responseElements": {
        "instancesSet": {
            "items": [
                {
                    "instanceId": "i-xxxxxxxxxxx",
                    "currentState": {
                        "code": 32,
                        "name": "shutting-down"
                    },
                    "previousState": {
                        "code": 16,
                        "name": "running"
                    }
                }
            ]
        }
    },
    "requestID": "12dd2efa-b1b8-4b3d-b5a0-3cc0535a3a1a",
    "eventID": "ab205086-b971-43b7-b27c-1463dfb07d56",
    "eventType": "AwsApiCall",
    "recipientAccountId": "5953xxxxxxxx"
}

Transition Workload(Before Termination)

Now we have got spot instance 2 minutes termination notification, it’s time to react on them. Moreover, we’d like a way to constantly monitor that. There are three general ways to listen to 2 minutes notice, then transit workload to somewhere (web application traffic as example).

Bash:

Command-line approach is to check EC2 Metadata periodically, which is most simple straight forward. In a web application, we usually setup ELB or ALB to dispatch traffic to EC2 spot instances. When we received spot instance 2 minutes termination notice, we need to deregister that instance from load balancer. Please remember to set “Connection Draining” value less than 2 minutes.

Since the entire stack were running on spotfleet from previous post, we know other type of new spot instance will register themselves as part of autoscaling effort to pick up web traffic. If one is running data process job on spot instance, he/she just need to wrap up the job and then commit existing result to S3 or other storage. We put up together a sample script on gist.

Last step is to setup cron job. According to official reference, it was recommended to set five seconds intervals. Crontab only support minute as minimum unit, but we can use some alternative method.

* * * * * for i in 0 1 2 3 4 5 6 7 8 9 10; do /opt/aws/bin/termination-wrap.sh & sleep 5; done; /opt/aws/bin/termination-wrap.sh

Lambda:

We spent some effort investigating lambda function to listen to the CloudTrail event. Because CloudTrail log store on the S3. The idea is to listen to S3 bucket for s3:ObjectCreated:* event. It triggers Lambda function reading the Amazon S3 event. Then it determines where the CloudTrail event name is “TerminateInstances“, and it will finally process the log records. There is a python example here.

The tricky part of lambda approach is lack of information associated AWS resource with instance. That’s to say the lambda function receives event json, which contains all the information about terminated instances but not ALB resource(in our example). If we need more than that, we have to call api to retrieve other information.

Combo (bash+lambda):

This is more robust approach if we want to accomplish some complicated work. We can still use same bash script, modify it to publish SQS queue along with parameters instead of complex deregister process from ALB. The lambda function listen to SQS queue. In this way, lambda function receives more information than CloudTrail event. Then we can abstract offloading spot instance logic in one lambda function for entire account.

CloudFormation Template

We are big supporter for infrastructure as code. As always we’d like to automate the process of adding cron job into spot instances instead of adding them manually. It can be easily accomplished by cloudformation template. We found Metadata attribute on EC2 instance is particular handy to handle the requirement.

 SpotFleet:
    Type: AWS::EC2::SpotFleet
    Metadata:
      AWS::CloudFormation::Init:
        config:
           commands:
             01_add_instance_to_cluster:
                command: !Join ["",["#!/bin/bash\necho ECS_CLUSTER=", !Ref "awsMeterCluster"," > /etc/ecs/ecs.config\n" ]]
             11_install_aws_cli:
                command: "yum install -y aws-cli"
             12_install_crontab:
                command: "(crontab -l 2>/dev/null; echo \"* * * * * for i in 0 1 2 3 4 5 6 7 8 9 10; do /opt/aws/bin/termination-wrap.sh >> /var/log/termination.log & sleep 5; done; /opt/aws/bin/termination-wrap.sh >> /var/log/termination.log\") | crontab -"
           files:
             /etc/cfn/cfn-hup.conf:
                content: !Sub |
                  [main]
                  stack=${AWS::StackId}
                  region=${AWS::Region}
                mode: '000400'
                owner: 'root'
                group: 'root'
             /opt/aws/bin/termination-wrap.sh:
                content: !Sub 
                  - |
                    #!/bin/bash
                      {bash-script in previous chapter}
                mode: '000751'
                owner: 'root'
                group: 'root'
             /etc/cfn/hooks.d/cfn-auto-reloader.conf:
                content: !Sub |
                  [cfn-auto-reloader-hook]
                  triggers=post.update
                  path=Resources.SpotFleetFullSetRegion.Metadata.AWS::CloudFormation::Init
                  action=/opt/aws/bin/cfn-init -v --stack ${AWS::StackId} --resource SpotFleetFullSetRegion --region ${AWS::Region}
                mode: '000400'
                owner: 'root'
                group: 'root'
           services:
             sysvinit:
                cfn-hup:
                  enabled: 'true'
                  ensureRunning: 'true'
                  files:
                     - "/etc/cfn/cfn-hup.conf"
                     - "/etc/cfn/hooks.d/cfn-auto-reloader.conf"
    Properties:
      SpotFleetRequestConfigData:
        IamFleetRole: !GetAtt iamFleetRole.Arn
        SpotPrice: !Ref 'EC2SpotPrice'
        TargetCapacity: !Ref 'DesiredCapacity'
        TerminateInstancesWithExpiration: false
        AllocationStrategy: lowestPrice
        LaunchSpecifications:
          - InstanceType: m3.xlarge
          ...
          - InstanceType: m3.2xlarge
          ...

 

In this post, we provided a few approaches to resolve spot accidentally termination problem. With Spot Instance 2 minutes termination notification, we are able to gracefully exit workload from spot instances and offload them to other places. We hope it helps to move some of your workload to spot instance, which can at least save you 50%-70% EC2 computing cost.