Run Jenkins Infrastructure on AWS Container Service

Our Journey to Continuous Delivery: Chapter 4 of 4

8 min readAug 31, 2017

--

One of the hardest challenges of delivering software frequently is the ability to build and test the high paced potential of code changes. In this chapter, we will present a scalable infrastructure to host Jenkins and its build farm.

Since we are already hosting our production in AWS, it was the natural place for us to host the build infrastructure. We wanted to simplify and standardize our solution. As such, we investigated existing services provided on AWS as our first consideration. Hence, we decided to use ECS for container management.

Amazon EC2 Container Service (ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon EC2 instances.

First, we will first review the full AWS stack necessary to run a Jenkins build farm using ECS. Second, we will provide a fully working example to deploy the stack. When running this example, we will cover two use-cases:

Auto-Scaling(+/-): When receiving a traffic spike, some EC2 instances are added to handle the load. When the system is idle, the extra instances are terminated until the minimum size is reached.
Self-Healing: When Jenkins crashes, it is automatically restarted in a new container and attached to the load balancer.

Jenkins ECS Architecture Overview

Being on the cloud, our architecture consists of virtual servers (EC2 instances, pictured as grey boxes), storage (EFS), and services (Load Balancer, Auto-Scaling Group, Alarms and Container Service).

To achieve auto-scaling, the alarms monitor the CPU usage and trigger the Auto-Scaling Group to add or terminate instances based on the number of Jenkins builds to run.

For self-healing, the ECS Cluster runs Jenkins as a service, ensuring that it is always up and healthy.

Scalable AWS Architecture based on ECS Cluster.

For each component, let’s highlight their role in the Jenkins Infrastructure.

ECS Cluster: manages Docker containers execution on the available pool of EC2 instances. Jenkins has an ECS plugin to execute tasks on the cluster.
Auto-Scaling Group: contains a collection of EC2 instances. It can adjust the number of instances based on events. Our build system has spikes and low usage periods, we need to adjust our infrastructure accordingly. When properly configured, it prepares you for future growth as the infrastructure adjusts with your needs.
Cloudwatch Alarms: send events to adjust the size of the Auto-Scaling Group based on current load. Setting alarms allows fine tuning how fast the infrastructure scales up and down. It is about finding the balance between preventing pipeline clogging and keeping AWS costs under control.
EFS Storage: provides scalable file storage, can be shared across multiple instances. It prevents Jenkins from running out of disk space. EFS can be mounted from many instances, it works better for self-healing. This is unlike EBS which can be slow to detect that the volume has switched from “in-use” to “available”.
Load Balancer: distributes incoming application traffic across multiple Amazon EC2 instances. It gives a consistent URL for the user while letting ECS decide where the application is hosted.
VPC: The Virtual Private Cloud is a virtual network dedicated to your AWS account. It is logically isolated from other virtual networks in the AWS cloud. Having this level of isolation for Jenkins increases the overall security in multiple ways, including that it: prevents someone with Jenkins access to reach production and allows defining VPC permissions for certain group of users. Billing can also be broken down by VPC.

AWS Hands-on: Scalable Jenkins Infrastructure using AWS

Cost Warning: This example is designed to run on micro instances (t2.micro), the maximum size of the Auto-Scaling Group has been set to 5 instances. Including storage and data transfer, it costs approximately $0.10/hour to run this stack.

In the next section, we will deploy the Jenkins infrastructure to AWS. Once deployed on the cloud, we will be able to simulate Auto-Scaling and Self-Healing.

1. Launch the CloudFormation stack

Here is the quick link that will take you directly to the Create Stack wizard with the expected parameters:

Even if you have never used AWS, this stack will work as it contains everything it needs. If you have some existing infrastructure, it won’t mix up as it is isolated in its own VPC.

The link will take you to the wizard page, the template is already selected, just click **Next**.

You can optionally restrict the allowed IP range, once done click **Next** and **Create**.

It takes approximately 5 min to complete. Once done, copy the Jenkins URL from the output.

CloudFormation completed successfully, Jenkins is up!

2. Simulate Auto-scaling: Adding and terminating instances based on load

Go to the JenkinsELB URL. Jenkins is already setup for ECS and comes with pre-configured projects. To observe auto-scaling, we will simulate a traffic spike by building “Run_All_Jobs”.

Click on the dropdown for “Run_All_Jobs” and select “Build Now”.

Now in EC2 Container service, select the cluster called “jenkins-cluster”, a few tasks are listed:

Jenkins Master: a service task.
Java or Javascript: these tasks are the first jobs in the queue making their way to ECS.

ECS shows the tasks currently running, each task runs in a Docker container.

Next, Jenkins attempts to add more tasks, but the cluster is at full capacity. After a minute of high CPU reservation the scale-up alarm goes off in Cloudwatch.

The Scale-up policy has changed to “Alarm” state

If the cluster reaches 75% CPU reservation, the alarm goes off, triggering new instances.

The alarm triggers the policy to create new EC2 instances in the Auto-Scaling Group. This process is repeated until it reaches the maximum allowed, set to five for this example.

Pro Tip: Setting a max value is good practice to prevent unexpected costs.

When notified by the alarm, the Auto-Scaling Group is launching new instances.

Back to ECS Cluster, the registered number of instances has increased from 2 to 4, more tasks are now running.

The cluster is placing the tasks in the container instances. One instance can run many tasks concurrently.

Back to Jenkins, the builds are running concurrently. Each job is executed in its own Docker container in an ECS task.

As a result of the added capacity, Jenkins can now run more jobs in parallel.

Once the spike is over, the ECS cluster still has 5 instances. If there is another surge, the instances are available to handle it.

If the period of inactivity is longer, we would like to scale down so we don’t pay for unused instances. For this example, we have setup another CloudWatch alarm that triggers after 10 minutes of inactivity.

After 10 minute of inactivity, the scale-down policy is in “Alarm” state

When receiving the event, the Auto-Scaling Group terminates some instances. This process is repeated until it reaches the minimum number of instances, set to 2:

1 instance for Jenkins master.
1 for instance available for upcoming builds, to prevent waiting in queue.

When notified by the alarm, the Auto-Scaling Group terminates some instances.

3. Simulate a Jenkins outage and observe self-healing

Finally, what happens if Jenkins crashes? To simulate an outage, let’s shutdown Jenkins:

Go to the homepage and add /exit to the end of the URL to access this feature. Press the button “Try POSTing”.

Check the Load Balancer, the instance is now listed as “OutOfService”.

Since the Health Check fails, the ELB marks the instance as OutOfService.

The Load Balancer deregisters the instance from the ECS service, which results in the initialization of a new task in ECS.

The ECS Service lists the succession of events. The downtime lasted 90 secs.

Once the service restarts the instance, Jenkins is restored.

To recap, here is a video of the whole experiment to see if you missed something along the way.

Deploy Jenkins infrastructure to Amazon EC2 Container Service.

Lessons Learned about AWS Hosting

We have accomplished our goal of scalable architecture for our build system.

Let’s step back and reflect on some learnings of…

AWS Experience: Even if you are only interested by one AWS service, you will need more knowledge when troubleshooting.Most likely, you will stumble into network, security and permission issues.
Permissions: Once we got our own AWS account with admin permissions, we started moving faster. It might be a big change for some companies. One suggestion is to work in different AWS Accounts or VPCs.
Self-Healing: anticipate Jenkins outage and design your system to recover from it. It could be an application failure but also an AWS failure. Be ready for it. We used to get paged for Jenkins down but not anymore.
EC2 Instance Types: Using containers gives you the flexibility of running multiple tasks on the same instance. It opens the door to multiple optimizations. Try out different scenarios and measure performance. We have some projects where we maintain a fixed group of cheap burst instances (t2) which are mostly idle but available when needed, and an Auto-Scaling Group of more expensive Compute Optimized Instances (c4) to handle spikes.

Pro Tip: In December 2017, AWS launched Fargate to not have to manage EC2 instances. Full review here.

Cost Optimization: be aware of the EC2 bill for your build farm and make adjustments when needed. It’s very easy to over-provision the infrastructure in the cloud. Use of auto-scaling should include some scaling-down policies as well. When this article was originally written, Amazon would charge per hour, but it has since switched to Per-Second Billing (Sept. 2017).

Now at the end of this series, let us review our goals of…

(Eliminating) Manual Work: Achieved by running Jenkins in Docker, configuring Jenkins through DSL and deploying infrastructure with CloudFormation.
(Reducing) Build Time: Use Jenkins DSL to configure jobs to run in parallel. Scalable Infrastructure to prevent waiting in queue. EC2 optimized instance.
Engineering Growth: Scalable architecture using Auto-Scaling Group(s).
Engineering Satisfaction: Engineers can maintain their build config within their Github config. Jenkins downtime are minimal, self-healing allows us to recover when it crashes.

Thank you for following along. If you would like more information, please ask your questions in the comments section below. You can also review the source code for the examples and check our video tutorials.

Previously in this series: Chapter 1, Chapter 2, Chapter 3.

Related: AWS Fargate review: ECS without EC2 instances.