The only constant in software development is change.
As companies grow and develop, the way in which they develop, test, build and deploy their applications continuously changes. Changing product requirements, the introduction of new technologies, and even internal organizational and cultural changes results in a technological environment highly conducive to the advent of new methodologies and new tools.
Some of the most popular open source tools used today began as a journey traveled by one organization, privately (Kafka, Prometheus, etc.). Later, this journey resulted in tools and solutions that were open sourced, shared and adopted by other organizations.
Today, I’m happy to share with you the story of one such journey that we made here at Logz.io — a one and a half year long Continuous Deployment (CD) journey that resulted in the full automation of our deployment to production, and an open source CD tool for deploying to Kubernetes — Apollo.
In the beginning…
Similar to the way I believe most startups begin, Logz.io deployments were handled using scripts. Lots of them.
We then naturally progressed to using Puppet which we love and still use. We found, though, that using Puppet to manage deployments was not a sustainable approach. Long before it became mainstream, we were using Docker in production, and our architecture was based on tying a container to an EC2 instance.
Just as an example, let’s take our alerting feature.
One container was responsible for this service, so we had an EC2 instance that was managed with Puppet, that enforced four running containers: the service code from the master branch, a Filebeat Docker container for logging, a collected container for collecting host metrics, and our own jmx2graphite container to collect JMX metrics.
Wanted – Continuous Deployment!
This model worked for a time, especially when the company and our application was smaller. As we grew and our services exploded, we began to realize we needed a simple, controlled, bulletproof way to get new code into production, and fast.
The plan we eventually came up with took a while to implement — 1.5 years to be precise, but we got there eventually. It evolved over time, but at the end was comprised of the following requirements and “wish-list”:
- When we started a container, we did not want it to require other miscellaneous containers to run
- We wanted to dispose of the “container-to-instance” linkage
- We wanted to change our branching model from gitflow
- We needed to uuid tag our containers
- We needed to educate, and make a cultural shift
Allow me to elaborate a bit on each of these requirements and some basic steps we took to implement them.
“Self Contained” Containers
This might be somewhat controversial, but the way we saw containers was that they should not require any additional peripheral or supporting containers to run. So, for example, logs and metrics should flow without needing another container in the pod to ship them. As opposed to logical containers, of course, that are required to run a particular service.
This part of the plan led us to develop a series of appenders and handlers — Logback appender, Logzio Java Sender, Python handler, and a Java agent version of our jmx2graphite — that allowed us to start our containers and get logs and metrics directly.
Realizing the inevitable impact on scalability and elasticity, one of our initial goals was to remove the existing architectural linkage between a container and an EC2 instance.
As we began to grow and expand, we found ourselves manually managing a growing amount of containers, each with its own scale and scaling rules. To support the expected growth and development velocity, orchestration was a necessity. We eventually made the decision to use Kubernetes for this purpose, and I explained why in a previous article.
Git Branching Model
When we deployed once per week, the GitFlow branching model made a lot of sense for us.
We created a release branch one per week, iterated it in staging, and then reached the final release — merging to master, and deploying all components using scripts. This was the process for deploying new code. Nothing else was deployed into production until the following week, except hotfixes of course.
That needed to change. It made no sense to create a release branch multiple times a day, and to deploy just a subset of the components with the new code. While this was the final stage that marked the completion of the process — we ended up deleting the develop branch, and branching in and out of master directly. Master considers ready to deploy at all times.
Before most of our services were continuously deployed, we worked in a hybrid fashion — release branch for the weekly deployments, and branching in and out of master (and back to develop) for the continuously deployed services.
Container Tagging, CI!
When we had a stable master that did not undergo any changes during the week leading up to deployment, we added the branch name to the image name and used the latest tag. That way we could just set the branch name, pull, and restart the container to deploy.
However, in the “continuously changing master” era, more control on what went into production was required.
We decided that the best way to achieve that was to remove the branch from the image name, and then tag the image with the commit sha1 (example: `image:abcdef1234567890`). This way, we know what’s in production at any time, we can deploy specific commits and easily rollback to a specific point in time if necessary.
Education and cultural shift
This is probably the most important part of the plan. The technological aspect is relatively easy, and at the end of the day, there are plenty of ways to get your code into production.
As an organization, there were some major cultural questions we needed to find the answers to.
Who has the power to deploy code into production? Can anyone deploy anything? What is the production team involvement in the process? Will developers remember to deploy code themselves? Will they be disciplined enough to test every change and rollback if needed?
Do we need to enforce bypassing in staging before production? Can production be deployed just from master? On specific hours? How do we make sure there is someone to support the system if all hell breaks loose? How can we make sure we can revert? And go forward with compatibility? And make DB changes?
You get the point.
That list goes on and on, and required us to make a lot of discussions, with all R&D members and management. I won’t describe the whole process here, but I strongly recommend to any company embarking on a similar journey to ask all of those questions, and more.
This is crucial for successfully implementing the process. Culture is EVERYTHING when discussing CD. Technology is just the means to an end.
We were still missing one piece of the Continuous Deployment puzzle — a simple, one-click deployment tool.
Sure, there are plenty of deployment tools out there. But none answered all of our requirements. The tool we were looking for had to be simple and must know how to play with Jenkins as we had no plans of replacing Jenkins. It needed to feature a comprehensive permissions mechanism and be able to record a full deployment history. It also had to be plugable so we could adjust it to our needs.
We spent a considerable amount of time and resources doing research but could not find a solution that suited us.
So we created our own – Apollo. And yes, it’s completely open source so you can use it if any of the above sounds familiar.
It supports a one-click deployment to Kubernetes, can be easily integrated into an existing environment and the best of all — does not require developers to know or understand one single Kubernetes-related concept. All developers have to do is select the Kubernetes cluster (and namespace), the relevant component and the git commit. That’s it. All in one UI, without the need to worry about anything else.
Once we had the initial tool working, our appetite grew and we ended up adding some additional features to Apollo.
Here is a list of Apollo’s main capabilities:
- Simple one-click deployment to Kubernetes
- Extensive deployment permission model
- “Virtual” environments based on Kubernetes namespaces, and node port coefficients
- Revert running deployment
- LiveTail on pod logs
- Live environment status from Kubernetes, with pod actions
- “Exec” into a running container using web-UI shell
- Jolokia tunneling via Kube proxy to java pods (and the integrated Hawt.io is just 1-click away)
- Full deployment history with a snapshot of the entire environments after each deployment
- “Groups” deployment from mustache templates and variables
- Blockers to block deployments based on numerous factors
How it works
The basic way Apollo operates is as follows:
Developer pushes code to Github -> Jenkins builds it and publishes the containers to an internal docker registry -> Jenkins notifies Apollo about a new “Deployable version” for a component -> Developer deploys in Apollo.
Most of Apollo is designed in a very pluggable way, and adding new capabilities should be really easy and straightforward. The Apollo backend is written in Java, and the Web-UI using AngularJS.
The State of CD at Logz.io
We have been using Apollo in production at Logz.io for the past 10 months. We have executed thousands of deployments using Apollo, almost half of them to production (the record was almost 100 deployments in a single day).
And the best thing of all? The production team is completely out of the loop. Developers own their code from A-Z. From the second they wrote it, to its build, staging, testing, production deployment and its ongoing support.
Summing it up
It was not an easy journey, to say the least. But at the end of the day, moving to CD with Apollo resulted in an extremely dynamic and versatile R&D organization.
Many meetings were held and many decisions had to be made, some tougher than others. Not everyone was on-board at first, and not everyone believed in the process. I don’t think anyone realized how big a change this was in the way we develop and interact with production.
Today, however, we have little doubt. We believe in the way we chose, and in the Apollo “state of mind”. This is a product we are fully invested in and are still actively developing. We are super-excited to open source it, and will be thrilled to accept contributions from the community!