Open Sourcing Coinbase’s Secure Deployment Pipeline

In 2017, Puppet and DORA (DevOps research and assessment) published their annual State of DevOps Report that collates more that six years of survey data about the cultural and technical impacts of DevOps. Analyzing over 27,000 responses they found that high performing engineering organizations have:

  1. 46x more frequent code deployments (on demand deployments)
  2. 440x faster lead time from commit to deploy (less than one hour)
  3. 96x faster mean time to recover from downtime (less than one hour)
  4. 5x lower change failure rate (0%-15%)

That is, high performing teams:

don’t have to trade speed for stability or vice versa, because by building quality in, they get both.

This goes against the common wisdom that we break things if we move fast. Instead, if a team focuses on building quality automation into their workflows then stability follows.

By this standard Coinbase has a high performing engineering organization, in that we:

  1. deploy hundreds of times per day across hundreds of projects.
  2. a feature can go from an idea, to code, to deployed into production in under an hour.
  3. failure rates are low, and are typically easily recoverable.

This is possible because most of our change management and deployment processes are automated and our awesome engineers have adopted a DevOps culture.

Today we are open sourcing a key part of that automation — our AWS deployer Odin. Odin takes a description of a project release and then safely and securely launches it into AWS using auto-scaling groups. The open-source Odin is a newer version of a closed Ruby version, and is still in alpha at Coinbase.

In this post we will describe the design of Odin, its features, and how such a deployer can help build a high performing engineering organization.

At its core Odin is meant to be simple and straight forward to use, while enforcing good engineering and security standards. As such, Odin was built towards:

  1. Ephemeral Blue/Green: create new services, wait for them to become healthy, delete old services; treating them as disposable and ephemeral.
  2. Declarative: describe what a successful release looks like, not how to deploy it.
  3. Scalable: can scale both vertically (larger instances) and horizontally (more instances).
  4. Secure: resources are verified to ensure that they cannot be used accidentally or maliciously.
  5. Gracefully Fail: handle failures to recover and roll back with no/minimal impact to users.
  6. Configuration Parity: minimize divergence between production, staging and development environments by keeping releases as similar as possible.
  7. No Configuration: once Odin is deployed it requires no further configuration.
  8. Multi Account: one deployer for all AWS accounts.

To satisfy the No Configuration and Multi Account requirements, Odin was implemented using native AWS technologies: a AWS Lambda Function and AWS Step Function (using the step framework) that deploys by assuming a role into an AWS account.

This means that the only requirement on running Odin is an AWS account, and the only prerequisite to deploy into an account is an IAM role with permission to do so.

Once the Odin lambda, step function and role are in AWS, a release can be deployed using the odin executable. For example:

odin deploy deploy-test-release.json
Odin deploy (sped up)

Where deploy-test-release.json file looks like:

{
"project_name": "coinbase/deploy-test",
"config_name": "development",
"subnets": ["test_private_subnet_a", "test_private_subnet_b"],
"ami": "ubuntu",
"user_data": "{{USER_DATA_FILE}}",
"services": {
"web": {
"instance_type": "t2.nano",
"security_groups": ["ec2::coinbase/deploy-test::development"],
"elbs": ["coinbase-deploy-test-web-elb"],
"profile": "coinbase-deploy-test",
"target_groups": ["coinbase-deploy-test-web-tg"]
}
}
}

This Declaratively describes the project that has one service web, that is:

  1. Deployed onto an Ubuntu AMI
  2. Into 2 subnets
  3. With a security group and instance profile
  4. Attached to an ELB and target group

To increase Configuration Parity all references to resources are tags instead of IDs, which can differ per environment.

If the user data key equals {{USERDATA_FILE}} the Odin executable replaces the user data with the .userdata file contents, e.g. deploy-test-release.json.userdata:

#cloud-config
repo_update: true
repo_upgrade: all
packages:
- docker.io
runcmd:
- docker run -d nginx

This will start the web service with an nginx http server, which will pass the ELB and target group health checks.

The Odin executable takes the deploy-test release file, attaches a few pieces of metadata like a release-id and created at date, and sends it to the Odin step function that:

  1. validates the sent release and all referenced resources.
  2. creates a new auto-scaling group for web service that starts nginx.
  3. waits for all EC2 instances in the web ASG to pass their ASG, ELB, and target group health checks. This may take a few minutes.
  4. Once healthy delete ASGs from a previous release and terminate their instances.

This is Ephemeral Blue/Green where old instances are deleted and new servers created. With this Coinbase can enforce our 30-day fleet age policy where we aim to have 98% of our instances under 30 days old.

Odin is a state machine, so we can visually see the progress of the deploy using the AWS console:

Odin’s state machine takes the original release object and passes it through each state adding and editing data until it reaches a success or failure state. The main Odin states are:

  1. Validate: validate the release is correct.
  2. Lock: grabs a lock so the same project-configuration cannot be deployed concurrently.
  3. ValidateResources: validate resources w.r.t. the project, configuration and service using them.
  4. Deploy: creates an ASG and other resource for each service.
  5. CheckHealthy: check to see if the new instances created are healthy w.r.t. their ASGs, ELBs and target groups. If instances are seen to be terminating immediately halt release.
  6. CleanUpSuccess: if the release was a success, then delete the old ASGs.
  7. CleanUpFailure: if the release failed, delete the new ASGs.
  8. ReleaseLockFailure: try to release the lock and fail.

Understanding how each state can go wrong and how to respond allows Odin to Gracefully Fail. Once a failure occurs Odin will try to leave AWS clean by deleting created resources. Some common failures are:

  • BadReleaseError: The sent release was invalid or a resource it referenced was invalid.
  • LockExistsError: Another deploy is currently going out, or a previous deploy failed in an unknown way and requires manual cleanup.
  • DeployError: Unable to create a resource.
  • HaltError: Halt was detected or instances were found terminating.
  • TimeoutError: The deploy took too long to become healthy. The default time Odin waits is 10 minutes, but the max time is 1 year (how long a step function can run).

Once Odin has finished deploying it will end in one of these states:

  1. Success: the release was deployed.
  2. FailureClean: the release was unsuccessful, but cleanup was successful so AWS was left in good state.
  3. FailureDirty: the release was unsuccessful, and cleanup failed so AWS was left in a bad state. This should never happen and you should alert if this happens, and file a bug in GitHub.

It is technically possible to end at any state if there is an error in Odin that cannot be recovered. If this happens alert and file a bug in GitHub as it is definitely a bug.

Scale has been important in every aspect of Coinbase recently. Around December 2017 we became both the 40th largest website in the USA and the top iOS app causing 20x more traffic than we received just a month before. The entire company had to quickly respond to this, especially our application engineers.

Fortunately, Odin was built with scale in mind and with only minor configuration changes applications could both increase their size and number of servers, as well as add auto-scaling rules to handle traffic spikes. For example, to scale the deploy-test web service we could:

{ ...
"services": {
"web": { ...
"instance_type": "c4.xlarge",
"autoscaling": {
"min_size": 3,
"max_size": 5,
"policies": [
{
"type": "cpu_scale_up"
},
{
"type": "cpu_scale_down"
}
]
}
}
}
}

With these changes deploy-test can handle increased traffic and scale instances relative to CPU so be resilient to sudden traffic spikes.

Deployers are critical pieces of infrastructure and must be Secure. Ensuring only authorized users can deploy, limiting what resources they can use, and being able to see who did what and when.

Authentication is handled by good IAM policies like ensuring that only Odin can deploy and only selected users can call the Odin step function.

Authorization is through using tags on resources so only the correct project, configuration and service can use them. Also, by restricting use of S3 you can limit who can deploy what project.

Replay and Man in the Middle attacks are protected by validating the creation date is recent and comparing the release to one uploaded to S3.

Auditing what happened and when is the easiest aspect of Odin. All executions of step functions and lambdas are written to logs by AWS which can be inspected. However to make them searchable you should automate their export to another service like Kibana or Datadog.

Odin @ Coinbase

Having our engineers manually manage multiple release bundles and deploying with an Odin executable is not a great user experience. Also, the release information and the code may be sensitive or mission critical, so to be safe and secure we would have to limit who can deploy to only “trusted” engineers. Bad UX and limitations slow down all engineers, introduce significant bottlenecks, increase deploy failures, and make us less secure.

To fix these issues we built Codeflow, an internal web application (not open source, yet…) that manages configurations and interacts with Odin. Codeflow tries to remove bottlenecks by letting all engineers deploy as long as the code and configurations have been reviewed. For example, here is how to deploy in Codeflow:

Codeflow and Odin have separate concerns of what is deployed and how to deploy respectively. Together they automate and secure our deploy pipeline by enabling our engineers to deploy. By focusing on this kind of automation Coinbase avoids trading speed for stability, and instead we get both and move deliberately to fix things.

If you want to work with a high performing engineering organization, you should join Coinbase!

The links in this blog post are being provided as a convenience and for informational purposes only; they do not constitute an endorsement or an approval by Coinbase of any of the content or views expressed by or on any external site. Coinbase bears no responsibility for the accuracy, legality or content of the external site or for that of subsequent links. Contact the external site for answers to questions regarding its content.


Open Sourcing Coinbase’s Secure Deployment Pipeline was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Powered by WPeMatico