Why Eventbrite runs a 700 node Kube cluster just for development

Q&A

How is the Eventbrite application architected?

What prompted you to rehaul your dev envs and what problems did you set out to solve?

How did you convince your company?

What’s the developer workflow like with yak?

  1. Reconnect to their previous session: this takes a few seconds and they can resume their work from where they left it the previous day.
  2. Update their local branch and update their remote docker images: this takes 5–7 minutes to get the environment updated.
  • Change code locally: Changed files are automatically synced over to their remote containers. It usually takes a few seconds for the changes to be available. We use rsync for this, which is very efficient. To keep it simple, we do a one-way sync (from laptop to remote container).
  • Debug code: Developers can add breakpoints in their code and attach to a running container to get a live debugging session. We provided a command that wrapped kubectl attach under the hood.
  • Run tests: Developers can run unit tests locally, but any tests that require dependencies (such as a DB or Redis) can be run remotely in a pod. For integration tests, they can run tests in a specific pod and connect to the other services directly.

How does it work?

How does the DevTool team interact with the development environments?

What’s the ongoing maintenance burden?

How has the environment changed from when you first designed it?

  • Our infrastructure was running on one EKS cluster originally. At one point, we had 700 worker nodes, and 14,000 pods running. We ran into performance and rate-limiting issues that made us reconsider this single-cluster approach. Over time, we switched to a multi-cluster architecture where each cluster had no more than 200 nodes.
  • Syncing the code directly into running containers could sometimes cause the container to crash if the changes made the application fail the probe check. After iterating a few times on how to solve this problem, we decided to set up a sidecar container that is responsible for syncing the code.
  • To persist data over time (for example to save the MySQL database files of a developer) we use Statefulsets backed by EBS volumes. However, AWS has a limitation around EBS volumes — an application running in a pod on EKS must be on a node in the same availability zone (AZ) as the EBS volume. To solve this problem, we partitioned our EKS nodes per availability zone and we used taints to make sure that our Statefulset would be in the same AZ.

Have there been any unexpected benefits?

  1. Sharing environments: Have you ever heard a developer say “but it worked for me locally when I ran the tests?” Consistency improves by running in the cloud. The ability to share developer environments proved to be very helpful when trying to understand test failures or work on issues that were hard to reproduce.
  2. Working globally: We have a globally distributed team but most of the test/QA infrastructure is in the US. Simple operations like resolving the application dependencies or downloading a Docker image requires a lot of networking round trips. If the network latency is poor, these operations are slow.
  3. By running on the cloud, the developer opens a connection to some container (with port forwarding or by getting a shell) and then they can run their commands from the same AWS region where the rest of the infrastructure is located. For our engineers based outside of the US, being able to develop on the cloud has been a huge improvement.
  4. Transitioning during COVID: When COVID happened, all developers switched to working remotely. For some, this meant sharing home internet connections with other households or moving back to their family. It would have been extremely difficult or impossible for some of them to run the developer environment locally. Operations such as pulling Docker images or resolving application dependencies would require gigabit of data on a daily basis. By developing on the cloud, the transition to remote work was fairly seamless and the developers were able to continue their work from home.

--

--

--

Opta — open source Infra-as-Code https://github.com/run-x/opta/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Learning to Code: Day 14 — Applied Visual Design Part 4

DevOps Introduction

Set Shadowsocks as HTTP proxy on Raspberry Pi

AWS Cross-Account Connectivity Implementation

Tokenomics of the Krodo project, all aspects of distribution!

การใช้ gui tkinter python ในงาน image processing การเพิ่มปุ่ม browse image และเพิ่มปุ่มรันกระบวนการ

GRID COMPUTING

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Remy DeWolf

Remy DeWolf

Opta — open source Infra-as-Code https://github.com/run-x/opta/

More from Medium

Multiple Azure functions in Golang

How to Build a Lightweight Go Application in DockerFile for EKS?

Version Control of Configuration Files Using Kubernetes

Deploy Socket.io to Kubernetes - Part 2: Infrastructure