Rate this book

Site Reliability Engineering: How Google Runs Production Systems

Name: Site Reliability Engineering: How Google Runs Production Systems
Rating: 4.21 (269 reviews)

Betsy Beyer, Chris Jones, Jennifer Petoff

Rate this book

Building and operating distributed systems is fundamental to large-scale production infrastructure, but doing so in a scalable, reliable, and efficient way requires a lot of good design, and trial and error. In this collection of essays and articles, key members of the Site Reliability Team at Google explain how the company has successfully navigated these deep waters over the past decade.

You ll learn how Google continuously monitors and deploys some of the largest software systems in the world, how its Site Reliability Engineering team learns and improves after outages, and how they balance risk-taking vs reliability with error budgets."

GenresTechnologyProgrammingComputer ScienceTechnicalSoftwareNonfictionEngineering

Audible Audio

First published April 16, 2016

1920 people are currently reading

7821 people want to read

About the author

Betsy Beyer

9 books34 followers

Betsy Beyer is a Technical Writer for Google in New York City specializing in Site Reliability Engineering. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally distributed datacenters. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

1,193 (41%)

4 stars

1,177 (41%)

3 stars

425 (14%)

2 stars

56 (1%)

1 star

14 (<1%)

Displaying 1 - 30 of 269 reviews

Simon Eskildsen

215 reviews1,137 followers

May 7, 2016

Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there's one piece that has all of what was previously something you'd had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, and each chapter stands too much on its own—building from first principles each time, instead of leveraging the rest of the book. This makes the book much longer than it needs to be. Furthermore, it tries to be both technical and non-technical—this confuses the narrative of the book, and it ends up not excelling at either of them. I would love to see two books: SRE the technical parts, and SRE the non-technical parts. Overall, this book is still a goldmine of information to a 5/5—but it is exactly that, a goldmine that you'll have to put a fair amount of effort into dissecting to retrieve the most value from, because the book's structure doesn't hand it to you—that's why we land at a 3/5. When recommending this book to coworkers, which I will, it will be chapters from the book—not the book at large.

Mircea

67 reviews13 followers

May 31, 2016

Boring as F. The main message is: oh look at us, we have super hard problems and like saying 99.999% a lot. And oh yeah... SREs are developers. We don't spend more than 50% on "toil" work. Pleeeease. Book has some interesting stories and if you are good at reading between the lines you might learn something. Everything else is BS. Does every chapter needs to start telling us who edited the chapter? I don't give a f. The book also seems to be the product of multiple individuals (a lot of them actually) whose sole connection is that they wrote a chapter for this book. F the reader, F structure, F focusing on the core of the issue. Let's just dump a stream of consciousness kind of junk and after that tell everyone how hard it is and how we care about work life balance. Again, boring and in general you're gonna waste your time reading this (unless you want to know what borg, chubby and bigtable are)

Michael Scott

772 reviews159 followers

April 23, 2016

Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a collection of essays on the design and operation of large-scale datacenters, with the goal of making them simultaneously scalable, robust, and efficient. Overall, despite (willing?) ignorance of the history of distributed systems and in particular (grid) datacenter technology, this is an excellent book that teaches us how Google thinks (or used to think, a few years back) about its datacenters. If you're interested in this topic, you have to read this book. Period.

Structure
The book is divided into four main parts, each comprised of several essays. Each essay is authored by what I assume is a Google engineer, and edited by one of Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. (I just hope that what I didn't like about the book can be attributed to the editors, because I really didn't like some stuff in here.)

In Part I, Introduction, the authors introduce Google's Site Reliability Engineering (SRE) approach to managing global-scale IT services running in datacenters spread across the entire world. (Truly impressive achievement, no doubt about it!) After a discussion about how SRE is different from DevOps (another hot term of the day), this part introduces the core elements and requirements of SRE, which include the traditional Service Level Objectives (SLOs) and Service Level Agreements (SLAs), management of changing services and requirements, demand forecasting and capacity, provisioning and allocation, etc. Through a simple service, Shakespeare, the authors introduce the core concepts of running a workflow, which is essentially a collection of IT tasks that have inter-dependencies, in the datacenter.

In Part II, Principles, the book focuses on operational and reliability risks, SLO and SLA management, the notion of toil (mundane work that scales linearly (why not super-linearly as well?!?!) with services, yet can be automated) and the need to eliminate it (through automation), how to monitor the complex system that is a datacenter, a process for automation as seen at Google, the notion of engineering releases, and, last, an essay on the need for simplicity . This rather disparate collection of notions is very useful, explained for the laymen but still with enough technical content to be interesting even for the expert (practitioner or academic).

In Parts III and IV, Practices and Management, respectively, the book discusses a variety of topics, from time-series analysis for anomaly detection, to the practice and management of people on-call, to various ways to prevent and address incidents occurring in the datacenter, to postmortems and root-cause analysis that could help prevent future disasters, to testing for reliability (a notoriously difficult issue), to software engineering int he SRE team, to load-balancing and overload management (resource management and scheduling 101), communication between SRE engs, etc. etc. etc., until the predictable call for everyone to use SRE as early as possible and as often as possible. Overall, palatable material, but spread too thin and with too much much overlap with prior related work of a decade ago, especially academic, and not much new insight.

What I liked

I especially liked Part II, which in my view is one of the best introductions to datacenter management available today to the students of this and related topics (e.g., applied distributed systems, cloud computing, grid computing, etc.)
Some of the topics addressed, such as risk and team practices, are rather new for many in the business. I liked the approach proposed in this book, which seemed to me above and beyond the current state-of-the-art.
Topics in reliability (correlated failures, root-cause analysis) and scheduling (overload management, load balancing, architectural issues, etc.) are currently open in both practice and academia, and this book emphasizes in my view the dearth of good solutions but for the simplest of problems.
Many of the issues related to automated monitoring and incident detection could lead in the future to better technology and much innovation, so I liked the prominence given to these topics in this book.

What I didn't like
I thoroughly disliked the statements claiming by omission that Google has invented most of the concepts presented in the book, which of course in the academic world would have been promptly sent to the reject pile. As an anecdote, consider the sentence Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, claims that reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it!. I'll skip the discussion about who is the originator of the term SRE, and focus on the meat of this statement. By omission, it makes the reader think that Google, through its Ben Treynor Sloss, is the first to understand the importance of reliability for datacenter-related systems. In fact, this has been long-known in the grid computing community. I found in just a few minutes explicit references from Geoffrey Fox (in 2005, on page 317 of yet another grid computing anthology, "service considers reliable delivery to be more important than timely delivery"), Alexandru Iosup (in 2007, on page 5 of this presentation, and again in 2009, in this course, "In today’s grids, reliability is more important than performance!"). Of course, this notion has been explored for the general case of services much earlier... anyone familiar with air and especially space flight? The list of concepts actually not invented at Goog but about which the book implies to the contrary goes on and on...

I also did not like some of the exaggerated claims of having found solutions for the general problems. Much remains to be done, as hiring at Google in these areas continues unabated. (There's also something called computer science, whose state-of-the-art indicates the same.)

compsci-tech

Sebastian Gebski

1,189 reviews1,339 followers

April 26, 2016

Very uneven. Exactly what you should expect of a book in which is chapter is a separate essay written by a separate group of people :) Chapters can be grouped into following categories:

a* solid knowledge, not really fascinating, but useful, some Google inside stories
b* fairly solid knowledge, boring due to massive repetitions or being too general
c* exciting stuff that is useless for you, because you're not Google (but still, it's exciting ;>)
d* exciting stuff that you actually may use outside of Google, sometimes with neat warstories

Sadly, it's more b than a & more c than d. But it doesn't change my opinion that this book is actually worth reading - it's one of the few books for the topic, it's based on actual engineering perspective of a very interesting company that operates in a massive scale, it's massively influenced by this organization's culture. Even typical Software Engineers (especially junior ones) should read it to learn that software delivery & maintenance is so much more than just simple development.

One last remark to conclude: sorry if I made a false impression, but this is NOT a technical book. It's far more about processes, communication, attitude & mindset than actual technology running under the hood.

Dimitrios Zorbas

28 reviews10 followers

July 26, 2017

I have so many bookmarks in this book and consider it an invaluable read. While not every project / company needs to operate at Google scale, it helps streamlining the process to define SLO / SLAs for the occasion and establishing communication channels and practices to achieve them.

It helped me wrap my head around concepts for which I used to rely on intuition.
I've shaped processes and created template documents (postmortem / launch coordination checklist)
for work based on this book.

devops

Michael Koltsov

110 reviews69 followers

March 3, 2017

I don’t normally buy paper books, which means that in the course of the last few years I’ve bought only one paper book even though I’ve read hundreds of books during that period of time. This book is the second one I’ve bought so far, which means a lot to me. Not mentioning that Google is providing it on the Internet free of charge.

For me, personally, this book is a basis on which a lot of my past assumptions could be argued as viable solutions with the scale of Google. This book is not revealing any Google’s secrets (do they really have any secrets?) But it’s a great start even if you don’t need the scale of Google but want to write robust and failure-resilient apps.

Technical solutions, dealing with the user facing issues, finding peers, on-call support, post-mortems, incident-tracking systems – this book has it all though, as chapters have been written by different people some aspects are more emphasized than the others. I wish some of the chapters had more gory production-based details than they do now.

My score is 5/5

James Stewart

38 reviews6 followers

July 21, 2016

Loads of interesting ideas and thoughts, but a bit of a slog to get through.

The approach of having different members of the team write different sections probably worked really well for engaging everyone, but it made for quite a bit of repetition. It also ends up feeling like a few books rolled into one, with one on distributed systems design, another on SRE culture and practices, and maybe another on management.

alper

207 reviews61 followers

October 14, 2024

“Kördüğüm olmuş bir kodla canlıda baş başa kalmak. Bir SRE’in dramı”. Evet, bu kitap Türkiye’de yazılsaydı böyle olurdu başlığı...

Kitaba “accelerate” kafası ile yaklaştım. Canlıya akan yapıda sürecin son adımı. Ama tabii ki sadece son adımı olarak değerlendirmiyoruz. Epey geriden sürece dahil olan bir organizasyon birimi SRE. Ne kadar erken devreye girdiklerine bağlı olarak da o kadar sağlıklı çalışan bir yapı. Bir sorumluluk almak durumunda değil. Ama onu belirlemek de bir şey. :) Aslında gerçek devops’un, google’ın SRE kurgusunda olduğunun da güzel bir anlatımı kitap.

Tüm platform bileşenlerine tavsiye ederim. Mimaridir, SRE’dir, Devops’dur…

Şunu demezsem içimde kalır: Ya adamların "postmortem reading clubs"ı var ya. Daha ne olsun. :) Post mortem yapıyorlar, o yapılır eyv. ama bunları yazıyorlar, bir de oturup bunlardan herkes ders çıkarsın diye oturup okuyorlar. Angarya anlaşılmasın sürecin önemli bir adımı olduğu anlaşılsın diye de stajer, junior değil herkese yazdırıyorlar. (Ben direktörlere yazdırırıdım bu arada. Zaten bir işleri yok, yatıryorlar) Genel olarak bu incelikte ele alırsan tüm süreçleri epey ilerlersin. Şaşıracak bir durum yok tabii.

Ha şu eleştiri yapılabilir kitaba yine accelerate açısından bakarak. Kitapta microservice’lere sonlarda değilniliyor ama pek az. Ben değineyim: Microservice dünyasında steam-aligned team bileşenlerine buradan ne kadar çok sorumluluk aktarabilirsen o kadar sağlıklı ilerlersin. Enabler olun arkadaşlar siz. Ekibin yetkinliğini arttırın. Öteki türlü bu yeni dünyanın cognitive load (bilişsel yükü) hepimize fazla. Google ölçeğinde ise hayal edemiyorum. Yani kitap 2016 basımı. Açıkçası birkaç yeni bölüm ekleseler şahane olur. Ama tabii sevdiğim tarz, yani metodolojiler üzerinden ilerleyen yaklaşımı ile, güncel olmadığnı iddia edemem.

Tekrar tekrar okunulası bölümlere bakalım,

Chapter 3 - Embracing Risk: Kitapta SRE'yi genel manada tanıtan bölüm bu.
Chapter 5 - Eliminating Toil: Düzenli şekilde sistemi “Toil”den arındırmalı. Böylece SRE ekiplerinin daha stratejik işlere odaklanmasına olanak tanımış oluruz.
Chapter 15 - Postmortem Culture: Learning from Failure: Sarı sitede de bu iş güzel yönetilirdi. Selam olsun.

Gerçek mühendislik bu dediğim 20-27 bölümler:

Chapter 20 - Load Balancing in the Datacenter: Weighted round robin güzel bir anlatımı.
Chapter 21 - Handling Overload: Seni sevdim “Lame Duck”
Chapter 23 - Managing Critical State: Distributed Consensus for Reliability: Bu konuları “Designing Data-Intensive Applications” güzel ele alır. Burada da iyi.
Chapter 24 - Distributed Periodic Scheduling with Cron: Paxos işleniyor. Açıkçası ben zookeeper ile işimi görüyorum. 🙈🙈 Ölçek önemli tabii. Onlar için nokta atışı çözüm bizim için “over engineering” olabilir.
Chapter 26 - Data Integrity: What You Read Is What You Wrote: Backup stratejilerini çok beğendim. Ufkum açıldı. Accelerate süreçlerime dahil ettim.
Chapter 27 - Reliable Product Launches at Scale: Canary, Feature flags…

Chapter 32 - The Evolving SRE Engagement Model: Burada sistemi nasıl esnek yapılandırdıklarını bir kez daha görüyoruz. Güzel bölüm. Her ekibin, her business’ın SRE ihtiyacı farklı şekillenebilir. Ona göre yapılandırmak doğru olan. Yoksa ders çıkarılan “Launch Coordination Engineering’de olduğu gibi geçişleri tıkayan bürokrasilere bir halka daha eklenir. Amacımız neydi? Canlıya doğru akan yapının pürüzlerini ortadan kaldırmak, o akışın sorunsuz, güvenilir bir şekilde devam etmesini sağlamak. Engellemek değil. Yeni darboğazlar koymak hiç değil.

Şununla bitirelim, bizde olsa nasıl başlardı diye girmiştim lafa. Onlarda nasıl başlıyor onunla bitirelim:

"Hope is not a strategy."

accelerate

Alexander Yakushev

49 reviews37 followers

May 5, 2018

This book is great on multiple levels. First of all, it packs great content — the detailed explanation of how and why Google has internally established what we now call "the DevOps culture." Rationale coupled together with hands-on implementation guide provide incredible insight into creating and running SRE team in your own company.
The text quality is top-notch, the book is written with clarity in mind and thoroughly edited.
I'd rate the content itself at four stars. But the book deserves the fifth star because it is a superb example of a material that gives you the precise understanding of how some company (or its division) operates inside. Apparently, Google can afford to expose such secrets while not many other companies can, but we need more low-BS to-the-point books like this to share and exchange the experience of running the most complex systems (that is, human organizations) efficiently.

management software-engineering

Regis Hattori

147 reviews12 followers

December 23, 2019

This book is divided into five parts: Introduction, Principles, Practices, Management, and Conclusions.

I see a lot of value in the first two parts for any people involved in software development. It convinces us about the importance of the subject with very good arguments, no matter if you are a software engineering, a product manager or even a user. This part deserves 5 stars

After some chapters of the Practices part, the conclusion I made is that this part of the book may only be useful if you are facing a specific problem and are looking for some insights but not to read end-to-end. Some examples are too specific for Google or similar companies that have not the same budget, skills, and pre-requisites.

In general, 3 stars is fair, but I will rate as 4 because I really liked the first 2 parts.

devops infrastructure sre

dantelk

208 reviews20 followers

June 17, 2025

This one if hard to evaluate. Sometimes, especially at the first quarter, I was thinking "I will give only two stars to this book". It was repetitive and monotonous, and focused on abstract or bureaucratic stuff which I yawned a lot. The mid chapters are much more technical, and I took a lot of notes. The last sections were also not interesting for a software developer like me. Of course, when you smell the book, it is perfumed with a little bit of vanity. However it also shows from a high level, the world's arguably most prestigious IT company executes its SRE business not very very different from many others.

Some concepts such as Paxos and Raft are already hard to understand, and explained not as good as some other classic books on those topics (Data Intensive Applications).

I think with a good editor, this book could have delivered much better, in a more focused fashion, with less pages. I wouldn't say I hated the book, but there was much room to improve.

Being good engineers is something, and authoring a book is something else.

3-4 stars. I would recommend reading it. But don't worry about skipping chapters.

Alex Palcuie

9 reviews49 followers

March 18, 2017

I think this is the best engineering book in the last decade.

favorites

Vlad Romanenko

33 reviews5 followers

September 3, 2021

Very useful and fundamental work for SRE discipline. Unsurprisingly a chunk of the book is quite Google specific.

available-to-read tech

Romain

910 reviews55 followers

July 30, 2021

Il s’agit du livre de référence dans le domaine, celui qui a lancé et donné son nom – enfin je crois – à la discipline visant à mettre le software engineering au service de la production ou des opérations. Avant cela, il y avait les développeurs en charge de concevoir les applications et ceux que l’on appelait les administrateurs s’occupaient de les déployer et de les superviser en production. Le problème avec ce modèle est que les uns ne connaissent rien – ou presque – au travail des autres et le résultat était au mieux chaotique et au pire donnait lieu à des querelles assez animées qui se transformaient vite en guerre de tranchées. Ça semble aberrant, mais c’était le modèle et c’est cette fracture que la mouvance DevOps s’est efforcée de résorber. Difficile de différencier clairement les deux mouvances, mais je dirais – ce n’est que mon point de vue – que le DevOps peut être atteint en constituant des équipes mixtes (développeurs + administrateurs) alors que les SRE possèdent des compétences mixtes (développement et administration), et mettent l’une au service de l’autre, l’Infrastructure as Code en est le parfait exemple.

Ce livre est une collection d’articles de plusieurs auteurs relatant divers aspects de leur discipline mise en pratique au sein de Google. Pour cette raison, le livre manque un peu de consistance et les articles sont inégaux de par leur intérêt ou leur format. Certains sont fondamentaux, comme ceux consacrés au monitoring, alors que d’autres valent plus pour leur côté retour d’expérience. Il est plus qu’intéressant pour comprendre un choix d’une technologie de comprendre les raisons qui ont poussé Google à développer l’outil Borg[1] dont la version open source n’est autre que l’une des technologies dont on parle le plus en ce moment, Kubernetes. Revenir à la genèse d’un tel outil permet d’en comprendre le but ultime, pouvoir faire tourner du code de façon complètement indépendante de l’infrastructure, et ce besoin semble parfaitement louable lorsque l’on dispose comme Google d’un nombre colossal de serveurs qui peuvent être parfois hétérogènes – je me doute que c’est au moins le cas pour des générations différentes. J’ai trouvé d’autres articles moins intéressants, mais ce ne sera pas forcément le cas pour tout le monde, c’est en fonction de son expérience, de ses attentes et de ses intérêts. Tout le monde n’est pas Google et n’a pas les mêmes problématiques à résoudre, mais il est toujours intéressant de connaître la façon dont elles ont été adressées, pour tenter de comprendre la démarche et de peut-être, très modestement s’en inspirer. Un livre de référence que l’on peut – qu’il vaut mieux – lire par petits bouts d’autant plus qu’il est disponible en intégralité en ligne – merci Google.

----

[1] Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines ((source)[https://research.google/pubs/pub43438/]).

Également publié sur mon blog.

c_s_chaos c_s_ops technique

Tomas

Author 2 books11 followers

December 25, 2017

This was a really hard read, in a bad sense. The first couple of dozen pages were really promising, but the book turned out to be unnecessarily long, incredibly boring, repetative and inconsistent gang bang of random blog posts and often trivial information. It has roughly 10% of valuable content, and would greatly benefit from being reduced to 50-pager. At it's current state it seems that it was a corporate collaborative ego-trip, to show potential employees how cool Google SRE is, and how majestic their scale happens to be. After reading this book, I am absolutely sure I would never ever want to work for Google.

Chris

45 reviews23 followers

December 7, 2016

There's a ton of great information here, and we refer to it regularly as we're trying to change the culture at work. I gave it a 4 instead of a 5 because it does suffer a little from the style – think collection of essays rather than a unified arc – but it's really worth reading even if it requires some care to transfer to more usual environments.

Mehdi Home

51 reviews11 followers

August 20, 2023

A must-read for every software engineer whether interested in SRE and DevOps or not!

computer

Bjoern Rochel

398 reviews83 followers

August 27, 2019

A little disclaimer: My review here is more about the concept and organizational parts than the pure technical aspects. Mostly because I manage engineering teams nowadays and these areas are the more important ones for me. This book contains also a lot of technical information on how to implement SRE that I would highly recommended for interested software engineers.

One aspect I liked in particular about SRE is the Error Budget concept, Googles way to manage the age old conflict between product and engineering on how to distribute development efforts around non functional requirements and especially technical debt on one side and new features on the other side. The data driven approach and consequently the depersonalization of this debate seems very sane and professional to me.

I also liked their emphasis on training, simulation and careful on-boarding for SREs. For me this is still an area where the majority of the industry has plenty room for improvement. Looking at what Google does here makes the rest of us look like f***ing amateurs.

Another thing that I’m almost guaranteed to steal is the idea of establishing a Production Readiness Review to ensure reliability of new products and features from multiple angles (design, security, capacity, etc.).

What I’m still trying to wrap my head around is whether having dedicated SRE teams are a good idea (in contrast to a you-build-it-you-run-it approach where every delivery team effectively owns the responsibility to reach the defined SLA/Os). A principle that I like a lot is to give engineers a lot of freedom but to also make them accountable for their decisions and the software they produce. Separating out production - fitness into a separate group/team sounds like it goes into the opposite direction. I can imagine that several factors play into this (standardization, active tech/stack management, skill availability, etc.) and certainly Google has carefully evolved it to where it is now, but my initial reaction for this idea was negative.

Overall a very good resource that I will come back to

2018 2019

SeyedMostafa Meshkati

65 reviews27 followers

October 12, 2021

I have to say, this book is my new Engineering Bible!
It has up and downs in some chapters, some are related and sensible to someone, some may not, depending on your experience and situation, but overall, it's a masterpiece.

In persian:
این کتاب عملا تبدیل به کتاب مقدس من تو بحثای مهندسی میشه.
کتاب با توجه به فصل‌هاش، بالاپایین‌های مختلفی داره، یه سری چیزها ممکنه برای بعضیا ملموس باشه، یه سری چیزا نباشه، احتمالا به خاطر شرایطی که توش بودیم و تجربیاتمون. اما به صورت کلی میشه گفت این کتاب یک اثر فوق‌العاده‌است.

favorites technical

Anita

11 reviews

March 30, 2024

If I needed to write a review in one sentence, it would be: How do you operate at scale!

Some things in the book could be more varied and exciting, but it's worth reading! It contains Google insider stories, lessons learned, and processes implemented! If you've been in the industry for a long time, many of the things in it are common sense, so you might end up nodding a lot while reading or simply finding it boring because it's nothing new.

Liviu Costea

29 reviews2 followers

November 16, 2019

A lot of food for thought, a book that became a reference in the field. The only problem is the wide coverage, you might find some chapters very niche, like not everybody cares how to build layer 4 load balancer.
Highly recommended if you are following devops approaches.

devops

Vít Listík

4 reviews3 followers

December 25, 2018

I like the fact that it is written by multiple authors. Everything stated in the book seems so obvious but it is so sad to read it because it is not yet an industry standard. A must read for every SRE.

Amir Sarabadani

77 reviews51 followers

December 24, 2019

It's basically a looong advertisement for google with some useful information inside while it should be other way around.

best-software-engineering-books

Jonas Minelga

25 reviews

December 21, 2021

Very long and detailed book. Information in it is extremely valuable, but i think Google is one of like 2-3 companies in the world, were all of that can be used. I think for broader audience it is too detailed in some parts, duplicate info in others, and slightly difficult to read. But overall, book provides a lot of amazing insights and provides many ideas.

Ahmad hosseini

320 reviews73 followers

April 3, 2017

What is SRE?
Site Reliability Engineering (SRE) is Google’s approach to service management.
An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
Typical SRE activities fall into the following approximate categories:
• Software engineering: Involves writing or modifying code, in addition to any associated design and documentation work.
• System engineering: Involves configuring production systems, modifying configuration, or documenting systems in a way that products lasting improvements from a one-time effort.
• Toil: work directly to running a service that is repetitive, manual, etc.
• Overhead: Administrative work not tied directly to running a service.

Quotes
“Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.” – Brain Redman
“Ways in which things go right are special cases of the ways in which things go wrong.” – John Allspaw

About book
This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you.
“Essential reading for anyone running highly available web services at scale.” – Adrian Cockcroft, Battery Ventures, former Netflix Cloud Architect

programming software-engineering

David

92 reviews5 followers

July 16, 2018

The book seems largely to be a collection of essays written by disparate people within Google's SRE organization. It's as well-organized and coherent as that can be (and I think it's a good format for this -- far better than if they'd tried to create something with a more unified narrative). But it's very uneven: some chapters are terrific while some seem rather empty. I found the chapters on risk, load balancing, overload, distributed consensus, and (surprisingly) launches to be among the most useful. On the other hand, the chapter on simplicity was indeed simplistic, and the chapter on data integrity was (surprisingly) disappointing.

The good: there's a lot of excellent information in this book. It's a comprehensive, thoughtful overview for anybody entering the world of distributed systems, cloud infrastructure, or network services. Despite a few misgivings, I'm pretty on board with Google's approach to SRE. It's a very thoughtful approach to the problems of operating production services, covering topics ranging from time management, prioritization, onboarding, plus all the technical challenges in distributed systems.

The bad: The book gets religious (about Google) at times, and some of it's pretty smug. This isn't a big deal, but it's likely to turn off people who've seen from experience how frustrating and unproductive it can be when good ideas about building systems become religion.

Somtochiama

17 reviews1 follower

March 11, 2024

Okay. It took me a while to get through this book. Good material all round and touches on base SRE principles. Just keep it in mind that some parts might not apply if you are not at Google scale.

Luke Amdor

8 reviews8 followers

October 16, 2017

Some really great chapters especially towards the beginning and the end. However, I feel like it could have been edited better. It meanders a lot.

Amr

48 reviews13 followers

March 8, 2020

The book is great in terms of getting more understanding of google’s SRE culture. But I got to a place where it became irrelevant to me to continue the book so I decided to drop it.

Sundarraj Kaushik

490 reviews6 followers

October 4, 2018

A wonderful book to learn how to manage websites so that they are reliable.

Some good random extracts from the book.

Site Reliability Engineering
1. Operations personnel should spend 50% of their time in writing automation scripts and programs.
2. the decision to stop releases for the remainder of the quarter once an error budget is depleted
3. an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
4. codified rules of engagement and principles for how SRE teams interact with their environment—not only the production environment, but also the product development teams, the testing teams, the users, and so on
5. operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
6. There are three kinds of valid monitoring output:
Alerts: Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Tickets: Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
Logging: No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
7. Resource use is a function of demand (load), capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software. These three factors are a large part (though not the entirety) of a service’s efficiency.

SLI - Service Level Indicator - Indicators used to measure the health of a service. Used to determine the SLO and SLA.
SLO - Service Level Objective - The objective that must be met by the service.
SLA - Service Level Agreement - The Agreement with the client with respect to the services rendered to them.

Don’t overachieve

Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),18 throttling some requests, or designing the system so that it isn’t faster under light loads.

"If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow."

Four Golden Signals of Monitoring
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

Latency: The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
Traffic: A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.
Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.
Saturation: How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., "Give me a nonce" or "I need a globally unique monotonic integer") that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.
Finally, saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours."

If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring.

Why it is important to have control over the software that one is using? Why and when it makes sense to roll out one's own framework and/or platform?
Another argument in favor of automation, particularly in the case of Google, is our complicated yet surprisingly uniform production environment, described in The Production Environment at Google, from the Viewpoint of an SRE. While other organizations might have an important piece of equipment without a readily accessible API, software for which no source code is available, or another impediment to complete control over production operations, Google generally avoids such scenarios. We have built APIs for systems when no API was available from the vendor. Even though purchasing software for a particular task would have been much cheaper in the short term, we chose to write our own solutions, because doing so produced APIs with the potential for much greater long-term benefits. We spent a lot of time overcoming obstacles to automatic system management, and then resolutely developed that automatic system management itself. Given how Google manages its source code, the availability of that code for more or less any system that SRE touches also means that our mission to “own the product in production” is much easier because we control the entirety of the stack.
When developed in-house the platform/framework can be designed to manage any failures automatically. There is no external observer required to manage this.
One of the negatives of automation is that humans forget how to do a task when required. This may not be always good.

Google Cherry Picks features for release. Should we do the same?
"All code is checked into the main branch of the source code tree (mainline). However, most major projects don’t release directly from the mainline. Instead, we branch from the mainline at a specific revision and never merge changes from the branch back into the mainline. Bug fixes are submitted to the mainline and then cherry picked into the branch for inclusion in the release. This practice avoids inadvertently picking up unrelated changes submitted to the mainline since the original build occurred. Using this branch and cherry pick method, we know the exact contents of each release."
Note that cherry picking is of specific release branches and not changes in specific branch.

Surprises vs. boring
"Unlike just about everything else in life, "boring" is actually a positive attribute when it comes to software! We don’t want our programs to be spontaneous and interesting; we want them to stick to the script and predictably accomplish their business goals. In the words of Google engineer Robert Muth, "Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code." Surprises in production are the nemeses of SRE."

Commenting or flagging code
"Because engineers are human beings who often form an emotional attachment to their creations, confrontations over large-scale purges of the source tree are not uncommon. Some might protest, "What if we need that code later?" "Why don’t we just comment the code out so we can easily add it again later?" or "Why don’t we gate the code with a flag instead of deleting it?" These are all terrible suggestions. Source control systems make it easy to reverse changes, whereas hundreds of lines of commented code create distractions and confusion (especially as the source files continue to evolve), and code that is never executed, gated by a flag that is always disabled, is a metaphorical time bomb waiting to explode, as painfully experienced by Knight Capital, for example (see "Order In the Matter of Knight Capital Americas LLC" [Sec13])."

Writing blameless RCA
Pointing fingers: "We need to rewrite the entire complicated backend system! It’s been breaking weekly for the last three quarters and I’m sure we’re all tired of fixing things onesy-twosy. Seriously, if I get paged one more time I’ll rewrite it myself…"
Blameless: "An action item to rewrite the entire backend system might actually prevent these annoying pages from continuing to happen, and the maintenance manual for this version is quite long and really difficult to be fully trained up on. I’m sure our future on-callers will thank us!"

Establishing a strong testing culture
One way to establish a strong testing culture is to start documenting all reported bugs as test cases. If every bug is converted into a test, each test is supposed to initially fail because the bug hasn’t yet been fixed. As engineers fix the bugs, the software passes testing and you’re on the road to developing a comprehensive regression test suite.

Project Vs. Support
Dedicated, noninterrupted, project work time is essential to any software development effort. Dedicated project time is necessary to enable progress on a project, because it’s nearly impossible to write code—much less to concentrate on larger, more impactful projects—when you’re thrashing between several tasks in the course of an hour. Therefore, the ability to work on a software project without interrupts is often an attractive reason for engineers to begin working on a development project. Such time must be aggressively defended.

Managing Loads
Round Robin Vs. Weighted Round Robin (Round Robin, but taking into consideration the number of tasks pending at the server)
Overload of the system has to be avoided by usage of load testing. If despite this the system is overloaded then any retries have to be well controlled. A retry at a higher level can cascade the retries at the lower level. Use jitter retries (retry at random intervals) and exponential retry (exponentially increase the time between the retries) and fail quickly to prevent overload on the already overloaded system.
If queuing is used to prevent overloading of server then sometimes FIFO may not be a good option as the user waiting for the tasks at the head of the queue may have left the system not expecting a response.
If task is split into multiple pipelined tasks then it will be good to check at each stage if there is sufficient time for performing the rest of the tasks based on the expected time that will be taken by the remaining tasks in the pipeline. Implement a deadline propagation.

Safeguarding the data
Three levels of guard against data loss
1. Soft Delete (Visible to user in the recycle bin)
2. Back up (incremental and full) before actual deletion and test ability to restore. Replicate live and backed up data.
3. Purge data (Can be recovered only from backup now)
Out of Band data validation to prevent surprising data loss.

Important to
1. Continuously test the recovery process as part of your normal operations
2. Set up alerts that fire when a recovery process fails to provide a heartbeat indication of its success

Launch Coordination Checklist
This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:

1. Architecture: Architecture sketch, types of servers, types of requests from clients
2. Programmatic client requests
3, Machines and datacenters
4, Machines and bandwidth, datacenters, N+2 redundancy, network QoS
5. New domain names, DNS load balancing
6. Volume estimates, capacity, and performance
7. HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out
8. Load test, end-to-end test, capacity per datacenter at max latency
9. Impact on other services we care most about
10. Storage capacity
11. System reliability and failover
What happens when:
Machine dies, rack fails, or cluster goes offline
Network fails between two datacenters
For each type of server that talks to other servers (its backends):
How to detect when backends die, and what to do when they die
How to terminate or restart without affecting clients or users
Load balancing, rate-limiting, timeout, retry and error handling behavior
Data backup/restore, disaster recovery
12. Monitoring and server management
Monitoring internal state, monitoring end-to-end behavior, managing alerts
Monitoring the monitoring
Financially important alerts and logs
Tips for running servers within cluster environment
Don’t crash mail servers by sending yourself email alerts in your own server code
13. Security
Security design review, security code audit, spam risk, authentication, SSL
Prelaunch visibility/access control, various types of blacklists
14. Automation and manual tasks
Methods and change control to update servers, data, and configs
Release process, repeatable builds, canaries under live traffic, staged rollouts
15. Growth issues
Spare capacity, 10x growth, growth alerts
Scalability bottlenecks, linear scaling, scaling with hardware, changes needed
Caching, data sharding/resharding
16. External dependencies
Third-party systems, monitoring, networking, traffic volume, launch spikes
Graceful degradation, how to avoid accidentally overrunning third-party services
Playing nice with syndicated partners, mail systems, services within Google
17. Schedule and rollout planning
Hard deadlines, external events, Mondays or Fridays
Standard operating procedures for this service, for other services

As mentioned, you might encounter responses such as "Why me?" This response is especially likely when a team believes that the postmortem process is retaliatory. This attitude comes from subscribing to the Bad Apple Theory: the system is working fine, and if we get rid of all the bad apples and their mistakes, the system will continue to be fine. The Bad Apple Theory is demonstrably false, as shown by evidence [Dek14] from several disciplines, including airline safety. You should point out this falsity. The most effective phrasing for a postmortem is to say, "Mistakes are inevitable in any system with multiple subtle interactions. You were on-call, and I trust you to make the right decisions with the right information. I'd like you to write down what you were thinking at each point in time, so that we can find out where the system misled you, and where the cognitive demands were too high."

"The best designs and the best implementations result from the joint concerns of production and the product being met in an atmosphere of mutual respect."

Postmortem Culture

Corrective and preventative action (CAPA) is a well-known concept for improving reliability that focuses on the systematic investigation of root causes of identified issues or risks in order to prevent recurrence. This principle is embodied by SRE's strong culture of blameless postmortems. When something goes wrong (and given the scale, complexity, and rapid rate of change at Google, something inevitably will go wrong), it's important to evaluate all of the following:

What happened
The effectiveness of the response
What we would do differently next time
What actions will be taken to make sure a particular incident doesn't happen again

This exercise is undertaken without pointing fingers at any individual. Instead of assigning blame, it is far more important to figure out what went wrong, and how, as an organization, we will rally to ensure it doesn't happen again. Dwelling on who might have caused the outage is counterproductive. Postmortems are conducted after incidents and published across SRE teams so that all can benefit from the lessons learned.

Decisions should be informed rather than prescriptive, and are made without deference to personal opinions—even that of the most-senior person in the room, who Eric Schmidt and Jonathan Rosenberg dub the "HiPPO," for "Highest-Paid Person's Opinion"