Just a few days before he died at the beginning of the 1990s, a wise man taught us that “the show must go on.” Freddie Mercury’s parting words have long provided the guiding light for many, if not all, ops teams. In their eyes, the production environment should be exposed to minimum risk, even at the expense of new features and problem resolution.
About 10 years ago, Google decided to change its approach to production management. It took Google only a few years to realize that, while R&D focused on creating new features and pushing them to production, the operations group were trying to keep production as stable as possible: the two teams were pulling in opposite directions. This tension arose due to the groups’ different backgrounds, skill sets, incentives, and metrics by which they were measured.
Trying to bridge this gap between the two groups, one of Google’s ops leaders—Ben Treynor—thought of an innovative solution. Instead of having an ops team built solely from system administrators, software engineers—with an R&D background and mentality—could enrich the way the team worked with the development group, change its goals, and help with automating solutions.
And so, the Site Reliability Engineering position was created. According to Google, SRE engineers are responsible for the stability of the production environment, but at the same time are committed to new features and operational improvement. Google decided its SRE teams should be composed of 50% software engineers and 50% system administrators. The engineers were driven to use software as a way of solving problems and perfecting what had historically been solved by hand. They integrated easily with the development team, and encouraged code quality improvements and automation testing.
Understanding the value of SRE, several organizations, of various sizes, decided to embrace its principles. Some, such as Dropbox, Netflix, and Github, are well known for being at the forefront of technology leadership.
Wait, Isn’t That DevOps?
DevOps is a more recent movement, designed to help organizations’ IT department move in agile and performant ways. It builds a healthy working relationship between the operations staff and dev team, allowing each to see how their work influences and affects the other. By combining knowledge and effort, DevOps should produce a more robust, reliable, agile product.
Both SRE and DevOps are methodologies addressing organizations’ needs for production operation management. But the differences between the two doctrines are quite significant—while DevOps raise problems and dispatch them to Dev to solve, the SRE approach is to find problems and solve some of them themselves. While DevOps teams would usually choose the more conservative approach, leaving the production environment untouched unless absolutely necessary, SREs are more confident in their ability to maintain a stable production environment and push for rapid changes and software updates. Not unlike the DevOps team, SRE also thrive on a stable production environment, but one of the SRE team’s goals is to improve performance and operational efficiency.
Google tried a few approaches to implementing the SRE process before finding the one that suited them best. One of these approaches attempted to tie the number of permitted releases to the product’s stability. This principle underlying this process is that new releases are green-lighted based on current product performance.
For each service, the SRE team sets a service-level agreement (SLA) that defines how reliable the system needs to be to end users. If the team agrees on a 99.9% SLA, that gives them an error budget of 0.1%. An error budget is exactly what the name suggests: the maximum allowable threshold for errors and outages. Here’s the interesting thing: the development team can “spend” this error budget in any way they like. If the product is currently running flawlessly, with few or no errors, they can launch whatever they want, whenever they want. Conversely, if they have met or exceeded the error budget, and are operating at or below the defined SLA, all new releases are frozen until they reduce the number of errors to a level that allows the launch to proceed. This process ensures that both the SREs and developers have a strong incentive to minimize the number of errors in production.
Another interesting approach recommended by Treynor, which is more related to SRE professionalism and efficiency, is allowing SREs to move between projects. Moreover, he suggests allowing SRE engineers to move to development, and even the other way around. As the work done by both teams is similar, these transitions help the ops team gain better and deeper knowledge of the product and code, and bring the Dev teams into the production space to help them understand its challenges. This strongly promotes a team atmosphere, rather than one in which an individual feels that “I’m on the SRE team for this product.” As part of this approach, Treynor had Dev teams handle 5% of the operations workload. This, according to many organizations, adds to the SRE team’s motivation and effectiveness.
Riddles in the Dark
Well, SRE seems perfect. As already stated, several large-scale organizations have chosen to move some of their production operations from old ops to SRE. However, there are still a few questions that need to be asked:
Is the Position of SysAdmin (production/ops) no longer relevant for SysAdmins?
Historically, almost all system administrators have come into their roles through tech support and similar work; or even just running Linux on their desktops and then transitioning into server work. It should be pretty clear that the same path is not available into SRE. In order to retain their positions, SysAdmins should now be more code-oriented, have the better technological knowledge, and be receptive to new methods of conducting the work they already do.
Can an SRE team prevent production incidents?
One of the strengths of a professional site reliability engineer is the ability to handle the growth of production load and traffic in-house. Monitoring and analyzing processes and logs with platforms such as the ELK Stack is part of the day-to-day workload, and the team should be able to identify problems as they occur, and even foresee risks to software stability. The power of this position—based on the skillset it requires—lies in developing solutions for these problems and risks. As Ciara, a software engineer in Google’s cloud storage SRE team, described it in a Life in Google post, “we solve cooler problems in cooler ways.”
Are we doing DevOps or are we doing SRE?
This question is usually asked by people trying to position themselves in the Operations world. According to many companies that implemented SRE in a slightly different way than Google, you don’t have to decide. At Reddit, ops engineers work on reducing toil, improving deployment, and scaling processes, but they are referred to as “DevOps.”
So, is it Just a Question of Context?
In her blog, Charity Majors talks about SRE and DevOps being two different operational approaches that any organization can choose to work by, but insists on emphasizing that there is no “correct” approach. Although Google and Dropbox have decided on SRE, this does not mean the rest of the world should do so as well. What fits Google’s needs and organizational philosophy does not necessarily work for other orgs—at any scale. Moreover, Charity believes that DevOps, having grown and evolved within a broader variety of software organizations, is the more flexible, collaborative, and adaptable approach, and will work better for most software organizations at all stages of development.
On the other hand, SoundCloud senior engineer, Matthias Rampke issued a series of tweets on how SRE and DevOps were basically the same, with only one difference—high management support. Though in contradiction to her blog, Charity Majors lent weight to this opinion during an AskMeAnything event that revolved around DevOps and SRE. She shared a story about a friend who was hiring for a startup, which, as an experiment, posted the exact same job description twice, the only difference being that one listing was titled “DevOps engineer”, and the other “SRE.” At the time of relating the story, DevOps was winning by ten or twenty percent. However, all this means is that the job title is irrelevant, and naming an SRE team does not magically bestow qualities upon it. Rather than focusing on the title, organizations must focus on the work being done.
To summarize, we can quote Matt Simmons, a technologist with many insights into this topic, who says that “Not every infrastructure needs an SRE, but every infrastructure could use an administrator who acted more like one.