SRE for beginners

Andrews Azevedo dos Reis

Jul 19, 2024
7 min read

This article provides an in-depth look at the role of integration testing within the software development process, particularly within the banking sector. It emphasizes the strategic importance of integration testing in managing the complexities of evolving banking systems.

In today's fast-paced digital landscape, system reliability is not just an expectation but a necessity. Site Reliability Engineering (SRE) blends the lines between software engineering and systems operations to ensure reliable, highly scalable software systems.

This article will dive into concepts, core practices, and SRE's critical contribution to modern technology environments.

What is SRE?

SRE stands for Site Reliability Engineering. Founded at Google in the early 2000s, it has evolved into a fundamental discipline for organisations that aim to build and maintain robust services at scale.

SRE is not a set of operational tasks; it's a culture.

It's a philosophy that integrates aspects of software engineering into IT operations to create highly reliable and scalable systems.

This culture is for organisations and teams committed to improving service reliability, efficiency, and customer satisfaction through proactive problem-solving and automation.

At its heart, SRE is about applying a software engineering mindset to system administration topics.

The goal? Keeping the customer happy with your system.

How? Creating automated solutions for operational aspects, such as deployments, monitoring, and infrastructure management, ensuring system reliability and efficiency.

Who is a Site Reliability Engineer? An SRE is a software engineer who embodies and applies this SRE culture. They are not defined merely by their tasks but by their approach to solving operational problems with software solutions.

Site Reliability Engineers leverage their coding and system design expertise to automate operational processes. These design systems are inherently reliable and work closely with development teams to ensure high service standards.

Why is SRE Important?

In an era where downtime can significantly impact customer satisfaction and revenue, SRE offers a proactive approach to preventing and mitigating issues before they affect users.

Here are several reasons why SRE is crucial:

Enhanced Reliability

By focusing on continuous improvement and automation, SRE increases service reliability, ensuring they remain available and performant.

Better Risk Management

Error budgets and SLOs provide clear metrics for balancing the pace of innovation with the need for stability, helping manage risk effectively.

Improved Efficiency

Automating operational tasks reduces toil and errors, allowing teams to focus on more value-added activities.

Stronger Collaboration

SRE fosters a culture of shared responsibility between development and operations teams, bridging gaps and aligning goals across functions.

The Core Practices of SRE

Service Level Indicators (SLIs)

SLIs are precise metrics used to measure the performance and health of a service. Common examples include:

latency: the time it takes to respond to a request
error rates: the percentage of requests that fail
uptime: the proportion of time a service is available.

For instance, an SLI could be "a web service's average response time measured in milliseconds".

When defining SLIs, here's a simple test to validate how good it is: if not meeting an SLI means losing users, you know it's a crucial indicator.

Service Level Objectives (SLOs)

SLOs are the targets set for SLIs, defining the level of service performance deemed acceptable and sustainable. They bridge technical performance with business objectives, ensuring customer satisfaction.

As such, these need to be agreed on by business stakeholders and engineering and frequently inspected for validity over time, with changing business priorities, peak/valley traffic and seasons.

An example of an SLO could be "99.9% of all web service requests should have a latency of less than 300ms over any given month".

Service Level Agreements (SLAs)

While SLIs and SLOs are internal metrics and objectives that guide reliability efforts, SLAs are the promises made to customers.

SLAs take the concept of SLOs further by formalising them into agreements that include consequences for not meeting the agreed-upon standards.

For instance, an SLA might stipulate that "if a service's uptime falls below 99.5% in any given month, the company will credit its customers a certain percentage of their monthly fee".

SLAs are crucial because they align business objectives with SRE practices, ensuring that customer satisfaction and service reliability are kept at the forefront of operations.

Error Budgets

Error Budgets bridge the gap between the aspirations of SLOs and SLA commitments.

They quantify the allowable margin of error, meaning how much a service failure is acceptable before it impacts customer satisfaction or violates SLAs.

Essentially, they translate the high-level objectives of SLOs into operational thresholds, ensuring that while aiming for excellence, there's a realistic allowance for imperfection. When an Error Budget is consumed, it signals a need to improve reliability to uphold the service's SLA promises.

An example of an Error Budget could be: "With an uptime SLO of 99.9%, the error budget allows for about 43 minutes of downtime per month".

In short, an Error Budget is an objective measure of risk related to a given SLO.

Automation and the Quest of Eliminating Toil

In the heart of SRE practices lies a critical mission: to eradicate toil.

If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.

– Carla Geisser, Google SRE

Toil represents the manual, repetitive, and automatable tasks that scale linearly with service growth, offering no lasting benefit to the service itself.

This is where automation becomes a game-changer. By automating these toil-heavy tasks, SREs ensure that operational work — though necessary — doesn't consume more than 50% of their time.

This strategy not only streamlines operational work but also ensures that SREs can dedicate more time to engineering projects that enhance service features or reduce future toil, aligning with the core goal of SRE: to innovate rather than operate.

In doing so, the essence of Site Reliability Engineering is preserved, emphasising engineering solutions over operational slog and paving the way for a toil-minimised future.

Observability

A pillar of SRE Practice, Observability is a foundational element in the SRE toolkit, providing the visibility necessary to understand systems' internal states through their external outputs.

Through observability, SREs can preemptively identify and address issues before they escalate into user-impacting problems. This practice encompasses collecting, analysing, and acting on data from logs, metrics, and traces to ensure the reliability and performance of services.

Tools like Datadog exemplify the power of observability in action, offering a comprehensive platform for monitoring, troubleshooting, and optimising applications. With Datadog, teams can aggregate and visualise data across various sources, making it easier to detect anomalies, understand system behaviours, and make data-driven decisions.

By integrating observability tools into their workflow, SREs equip themselves with the necessary insights to maintain system health and uphold the highest service reliability standards.

Incident Management and Post-Mortems

SREs approach incidents with a dual focus: immediate resolution and future prevention.

Following an incident, the team conducts a post-mortem analysis to uncover the root cause, document lessons learned, and implement measures to prevent recurrence, all within a blameless culture that encourages transparency and improvement.

Like most team sports, it involves both tactics and strategy. You can imagine that once the emergency is detected, Damage Control involves tactical work, and for that, there must be a playbook in place to ensure a short time to restore (TTR). For the strategic work, read on.

Capacity Planning

This practice involves forecasting system demand and ensuring adequate resources are available to meet this demand, even as it grows or spikes unpredictably.

Effective capacity planning maintains system reliability and performance under diverse conditions, preventing saturation and performance degradation.

Effective capacity planning in SRE extends beyond mere computational resource management; it equally prioritises ensuring human responders and stakeholders are prepared and available to take appropriate actions when necessary.

This dual focus ensures systems have the computational horsepower to handle demand and that teams are structured and ready to respond to issues quickly and efficiently.

By forecasting future demands and potential bottlenecks, SREs can proactively scale up resources or adjust strategies to mitigate risks. Simultaneously, by planning for human response capacity — on-call scheduling, defining escalation paths, or ensuring stakeholder alignment — SREs guarantee that the system's reliability is maintained through both technological and human resilience.

This comprehensive approach to capacity planning is crucial for sustaining service performance and reliability, especially during unexpected demand spikes or critical incidents.

Conclusion

Embracing the principles of Site Reliability Engineering is more than just a commitment to maintaining systems; it's a pledge to ensure that those systems evolve and improve, keeping up with your business's needs and customers' expectations.

SRE is not just about keeping systems up and running; it's about making them resilient to failure, adaptable to growth, and responsive to innovation demands. Instead of pursuing perfection in software systems, it acknowledges that variability and unpredictability are the actual, real-world constants.

From setting clear expectations with SLIs, SLOs, and SLAs to embracing the crucial role of automation and observability, SRE practices offer a roadmap for creating systems that are not just robust but genuinely resilient.

Moreover, the focus on eliminating toil and enhancing capacity planning underscores the commitment to maintaining and continuously improving the quality of services.

As we conclude our journey through the world of SRE, you're no longer an outsider to these acronyms. With this newfound knowledge, you can now look beyond the jargon and understand how SRE functions as the backbone of a robust, scalable, and efficient digital infrastructure.

So, the next time you encounter SLOs, SLIs, SLAs, or any other SRE acronym, you'll recognise them and appreciate their crucial roles in Site Reliability Engineering.

This article was written by Andrews Azevedo dos Reis for the WAES Medium blog.

What we do

WAES supports organizations with various solutions at every stage of their digital transformation.

Discover solutions

Work at WAES

Are you ready to start your relocation to the Netherlands? Have a look at our jobs.

Discover jobs

Let's shape our future

Work at WAES

Start a new chapter. Join our team of Modern Day Spartans.

Discover jobs

Work with WAES

Camilo Parra Gonzalez

Account Manager

+31 6 27 96 53 12 cgonzalez@wearewaes.com

SRE for beginners

What is SRE?

Why is SRE Important?

The Core Practices of SRE

Conclusion

Thanks for reading.

Now let's get to know each other.

What we do

Work at WAES

Related articles

Minimal API Gateway with Spring Boot