>Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.
This is a thread for all discussion of Site reliability engineering
SRE is usually used at larger organizations which have multiple applications, services, etc. Where it gets confused with devops is that devops is automation and streamlining of traditional sysadmin tasks related to integration and deployment pipelines, essentially programmatically managing traditionally manual processes like testing and deployment to various environments. Whereas SRE is automating sysadmin tasks related to performance, observability, and reliability, as is in the name. The job devops engineer is replacing is the traditional sysadmin, wheras the job the SRE is replacing is whats known as a production support engineer or creation software engineer, less well known but essentially a SWE who's job it is to support applications. Its not support like helpdesk-install windows or whatever, its support engineering of an application, or set of applications, restarting scheduled jobs, database tasks, debugging java code, if heap utilization is too high or whatever. Essentially providing L2 support for complex enterprise or microservice type applications (L3 handled by the SWE's).
SRE is automating those tasks and building software that automatically monitors ops and programmatically restart services, etc. Essentially automating the job of a traditional production support engineer so that the system self heals and automatically detects and solves common issues by itself. Google invented SRE role so that it could create the outside appearance that the service is always up. The most extreme example is the chaos engineering/chaos monkey approach to SRE invented by netflix which had its reliability engineers invent tools that automatically crash certain servers in production so that they could make sure the system keeps running no matter what.
So both SRE's and DevOps engineers are focusing on automating traditional sysadmin tasks so they overlap, however their focus is slightly different on the types of tasks. SRE's are usually either devs who transitioned to prod support/ops or system administrators who learned how to code. At FAANG type companies SRE interviews have SWE type leetcode questions as well as being grilled on the detailed internals of linux, kernel, OS, etc. At lesser companies it may simply be a retitled application support engineer.
SRE is the new hotness and more and more organizations are retitling their prod support teams as SRE and teaching them python/golang, cloud, and obserability (Splunk, etc.). Lots of companies are hiring "SRE" but the industry hasn't quite settled on a standard set of job duties. At FAANG SRE are considered as SWEs and on the same payscale. However at other companies they may be considered ops. In general SREs have a higher level of pay, due to it being the new hotness, but we will see how long that lasts.
Sounds awesome. I am a tech boomer but I'd love to see some implementation of this for small scale industry. Anyways don't mean to derail but here's my bump.