>Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.This is a thread for all discussion of
Site reliability engineeringSRE is usually used at larger organizations which have multiple applications, services, etc. Where it gets confused with devops is that devops is automation and streamlining of traditional sysadmin tasks related to integration and deployment pipelines, essentially programmatically managing traditionally manual processes like testing and deployment to various environments. Whereas SRE is automating sysadmin tasks related to performance, observability, and reliability, as is in the name. The job devops engineer is replacing is the traditional sysadmin, wheras the job the SRE is replacing is whats known as a production support engineer or creation software engineer, less well known but essentially a SWE who's job it is to support applications. Its not support like helpdesk-install windows or whatever, its support engineering of an application, or set of applications, restarting scheduled jobs, database tasks, debugging java code, if heap utilization is too high or whatever. Essentially providing L2 support for complex enterprise or microservice type applications (L3 handled by the SWE's).
SRE is automating those tasks and building software that automatically monitors ops and programmatically restart services, etc. Essentially automating the job of a traditional production support engineer so that the system self heals and automatically detects and solves common issues by itself. Google invented SRE role so that it could create the outside appearance that the service is always up. The most extreme example is the chaos engineering/chaos monkey approach to SRE invented by netflix which had its reliability engineers invent tools that automatically crash certain servers in production so that they could make sure the system keeps running no matter what.
So both SRE's and DevOps engineers are focusing on automating traditional sysadmin tasks so they ove
Post too long. Click here to view the full text.