Site Reliability Engineering and Devops are often perceived as two different methodologies to product development and IT operations.
The bottom line is SRE and DevOps share the same foundational principles. The goal for both of them is to enable your business to be agile and move faster.
Looking at the underlying objectives, scaling, automating and bridging a gap between operations and development.
I do think that SRE provides some really good guideline on how to achieve the overall goal. For monitoring, there are service level indicators and service level objectives. For automation you specify a minimum amount of time spent on automating. SRE share the responsibilities with the development teams etc.
This makes SRE easy to approach and implement in your organization.
But let’s drill down and see what the two “approaches” has to offer.
What is Devops?
Back in the days, devops was created to close the gap between operations and development. The development department was creating products, and handing them to the operations team who deploy them in production.
In a microservice architecture world running at scale, it is not uncommon to have 500 or even 1000+ services running. This is without a doubt a challenge for any operations team.
But often the operations team takes the approach of saying “We deploy containers and servers, but we don’t care about what’s running on them”.
I absolutely understand the approach, it simplifies the world for the operations team. But it’s almost guaranteed to create tension and frustration between the two departments.
Who’s responsibility is it when an incident or an alert is triggered. Is the incident caused by something in the infrastructure, or by the application?
5 devops pillars
- Reduce organizational silos
- Accept failure as normal
- Implement gradual changes
- Leverage tooling and automation
- Measure everything
Reduce organisational silos
Tearing down silos, the most important part of devops. This is all about getting out there and share what it is your devops team is working on. It’s advertising your team’s newest achievements, and involving the rest of your organisation.
Conversations need to be out in the open, information should be available and accessible for everyone.
Accept failure as normal
The important part is, not to panic, it’s normal and it is not dangerous. It’s another opportunity to hardening your infrastructure and make it even better.
Implement gradual changes
This is not specific to devops, this is a general programming best
Leverage tooling and automation
Devops tooling and automation play a central role.
Automating pipelines and processes such as continuous delivery, system monitoring, alerting, orchestration, release management etc.
Automation is in many ways really rewarding, both from a personal and financial perspective.
When you automate a task you have executed
Common devops automation tools often include: Chef, Ansible, Jenkins, Terraform, Python, the list goes on.
It’s impossible to improve anything without measuring.
You need to know if the changes you are making are moving the business in the right direction.
Devops is data-driven, decisions have to be based on facts, and supported by your data, not by opinions or gut feelings. So measure every single aspect of your infrastructure, pipelines, and processes.
Measuring everything refers to, you should not just monitor your customer-facing services but also your internal ones.
Measure what your team is spending time on so you can identify your ops work. Then you know exactly where to focus next time you get an opportunity to automate a task.
Your measurement effort needs to be reflected throughout your organization, your data should be transparent.
Devops CAMS model
CAMS stands for Culture, Automation, Measurement, Sharing.
CAMS was coined by Damon Edwards and John Willis at DevOpsDays Mountainview 2010.
The model is seeing good adoption in many devops team’s as it fits very well with the devops pillars
What is SRE?
Site Reliability Engineering was created at Google by Ben Treynor around 2003.
Since then Site Reliability Engineering or SRE has evolved and adopted by a wide range of companies like:
SREs collaborates closely with product developers, they have shared ownership and shared responsibilities. Tools used by SRE are shared with developers so they have access to perform the same tasks as SRE, this goes the other way too.
This ensures that the designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability. They also work with release engineers to ensure that the software delivery pipeline is as efficient as possible.
SRE’s can code to. SRE is what happens when you have developers solving an operational task, this is one of the bigger difference between devops and SRE.
SRE is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of product services.
That’s a lot of responsibility, to help grasp all these areas, just like devops, SRE has some foundational pillars.
5 SRE pillars
- Embracing Risk
- Service Level Objectives
- Eliminating Toil
Embracing risk means failures will happen, that being human failure or computers failing.
Accept that this is something that happens occasionally when fully accepted, teams are not afraid to move faster. They learn how to recover when changes break production, recovery becomes a natural process.
It is important to learn from failures, so all outages need to through a blameless post mortem. In the post
Service Level Objectives
This is a very concrete way of addressing risk, innovation, and feature development.
Service level objectives is a quantitative measurement of a given service that reflects reliability.
It an excellent approach to keep technical debt at a reasonable level. A service has one or more SLOs, in most cases, it only has one.
Toil also known as opswork is repetitive work that can and should be automated.
SRE’s have a fixed maximum amount of time they are allowed to spend on toil, the rest have to go to automation.
This means that you as an SRE are allowed to push back if the amount of toil becomes to big.
Like in devops you should monitor all your systems for continuously improving.
The SRE approach to monitoring is SLIs and SLOs.
SLI stands for service level indicator, common indicators are latency, throughput and success rate.
A service can have one or more SLIs, but typically a service has 3-6 defined SLIs, the SLIs than feed the services SLO with data
No one wants to do the same repetitive task over and over again, in SRE there is a high focus on automation. A rule of thumb is that SRE should spend 50% or more of their time automation manual tasks.
If you want to get more insights, to the day to day work of an SRE I can recommend you watching this hangout. https://www.youtube.com/watch?v=bwt6TZjefGM&feature=youtu.be
What are the similarities between DevOps and SRE
The overall goal for both devops and SRE is the same and when comparing SRE with the DevOps pillars looks as like this:
- Reduce organizational silos
- SRE shares ownership with developers to create shared responsibility.
- SREs use the same tools that developers use and vice versa.
- Accept failure as normal
REs embrace risk.
- SRE quantifies failure and availability in a prescriptive manner using SLIs and SLOs.
- SRE mandates blameless post mortems.
- Implement gradual changes.
- SRE encourages developers and product owners to move quickly by reducing the cost of failure.
- Leverage tooling and automation.
- SREs has a high focus on automating toil away.
- Measure everything.
- SRE defines prescriptive ways to measure values.
- SRE fundamentally believes that systems operation is a software problem.
Are there any differences than?
Every organization has its own slightly adjusted way of doing devops or SRE, and that’s how I think it should be
Depending on where you are on your journey, you should adopt whatever makes sense to your company
The primary goals are the same, personally, I really like the SRE approach, as there is a good balance of development and automation.
My favorite part is the SLI and SLO approach, they are a really good way of measuring application performance.