Reducing Operations Toil with Site Reliability Automation

Reducing Operations Toil with Site Reliability Automation

If you never heard about Site Reliability Engineering (SRE), you probably don’t know about operations “toil”. However, according to Gartner, “in our 2019 DevOps survey, 41% of respondents have already adopted certain elements of SRE, and an additional 42% plan to implement SRE practices by YE20”.1So, we believe it’s no doubt you’ll become familiar with the concept of operations toil very soon, and also with the challenges for adopting an SRE approach. This is just a matter of time.

What is SRE?
Site Reliability Engineering is not new; it is based on practices that Google started to put in place even before DevOps became mainstream. When employing SRE models, teams take a software engineering approach to IT Operations. SRE is aimed at handling bigger volume of changes faster, and accepting the risk induced by change. That explains why DevOps and SRE approaches work well together, and why SRE is sometimes considered as an extension to DevOps (even is SRE is older than DevOps).

DevOps SRE
Focus on continuous delivery Focus on service management
Bridge organizational silos Leverage tooling across teams
Accept failure as ‘normal’ Accept risk on service levels
Implement iterative changes Implement “atomic” changes
Automate delivery toolchain Automate standard operating procedures
Optimize TTM (Time-To-Market) Optimize MTTR (Mean-Time-To-Repair)

The main challenges of SRE
As teams seek to pursue SRE initiatives, the tools in place can offer significant benefit—or pose a massive challenge. The reality is that many organizations looking to adopt SRE models are employing loosely connected toolchains. By introducing a multitude of tools, due to the ensuing heterogeneity, it is harder for staff to manage the infrastructure efficiently and to expedite problem-solving. As a consequence, SRE adopters are facing two notable challenges:

  • Reducing operations toil. In Google’s definition, toil is not just “work I don’t like to do”. It is the kind of manual, repetitive, and mundane work that provides little value to operations. It is essential to reduce toil if you ever think about dealing with faster pace and greater volume of changes.
  • Reducing MTTR. The Mean-Time-To-Repairmeasures how long it takes operational teams to fix a problem, either through a workaround, a rollback, or another action. As a matter of fact, reducing MTTR has a significant impact on the overall customer experience. It is also critical for SRE teams to minimize MTTR, because as they accept the risk of change, they need to recover fast to protect the service levels business expects.

Because of the need to handle more changes, faster, while protecting service levels, many organizations view SRE as a monumental undertaking. As a result, they either stall initiatives or try to implement too many changes too quickly, often failing to deliver enough value to justify sustaining the effort.

Site Reliability Automation
Enhancing infrastructure monitoring with intelligent recommendations and auto-remediation capabilities can help organizations create more resilient production environments, streamlining their Site Reliability Engineering initiatives.

An integral part of BizOps from Broadcom and AIOps solutions powered by automation.ai, Broadcom’s Site Reliability Automation includes contextual automation that provides seamless integration of root cause alarms with remedial workflows. This contextual awareness lets SRE teams easily automate standard operating procedures that can be reused across environments. While contextual automation contributes to reducing operations toil, it also enables teams to deal with a bigger volume of events. In parallel, a recommendation engine leverages cross-domain insight to assist staff in choosing the most effective course of action for issue remediation. Machine learning algorithms are used to rank the most successful remediation workflows in regard to the context. That continuous learning helps resolving more issues faster and reduces the MTTR.

Site Reliability Automation addresses two major challenges of SRE teams by efficiently reducing operations toil and improving MTTR. Ultimately, Site Reliability Automation empowers DevOps initiatives by aligning infrastructure management with the pace of modern continuous delivery. In the very near future, SRE will become mainstream; automation will make the decisions and ensure consistent operations. Probably a good time to urge reviewing your automation strategies.

Sources:

1Gartner, “DevOps Teams Must Use Site Reliability Engineering to Maximize Customer Value” (Gartner subscription required), George Spafford and Manjunath Bhat, Published 10 January 2020