As a kid I grew up reading a lot of science fiction. My forbearing parents used to let me take out from the library the max number of books each week they would allow (30, I still remember that number). And each week I would go back for more. Given this constant consumption of augury you would think something I read would have prepared me for the future we now face within the Operations space.

While there are definitely some inklings in the science fiction canon about computer systems constructed at such scale that they would be hard for humans to understand, there is precious little attention paid to what it would take to operate them in production. Welcome to my world (and your reality, too, I bet).

At the AIOps Virtual Summit we discussed two separate approaches to handling this level of complexity and how they intersect. The first is the engineering discipline known as Site Reliability Engineering (SRE) which aims to engineer failure out of the system. The second, AIOps, is a newly coined term for the application of a class of advanced algorithms to the massive corpus of operational data we are now accumulating just as part of the ordinary day-to-day activity of running all of these systems and services.

One goal of the former is to construct a set of operational practices that allow us to navigate the tricky path between a desired feature velocity (iterating the software as fast as possible to provide the features a business needs to provide to its customer base) and a desired level of operational stability (keeping the system available for those customers). This is trickier than it sounds for at least three reasons:

  1. There are often completely different sets of people working on these problems.
  2. They have very different incentives around the work.
  3. Communication between these groups is often, shall we say, a little dicey.

SRE, like many other engineering disciplines, is a data-driven approach. It uses data (in ways we’ll talk about in the upcoming session) to help create productive conversations and decision making easier between these different groups.

AIOps similarly tries to use operational data to provide a big win for an organization. It attempts to address the hard problem of “we have all of this data on the operational status and performance of our infrastructure, what can we learn from it?”

Can the record of the past help us understand how things are working in the present or even help predict the future? Is there information in the data I have already that might provide some insight into how my systems are behaving? For example:

  • Is this just a spike in traffic or an indication my systems are about to experience a tailspin into failure?
  • Are there any difficult-to-see patterns in the load in my system that could help me optimally provision my resources so I don’t pay more than I need to?
  • Have we ever seen a outage like the one we are experiencing? (and how did we deal with it last time?)

Some of this is real today, some of it is easily imagined. There are definitely limits on what AIOps can offer our operations practices, but we surely haven’t taken it to its full potential yet.

Watch a replay of myself and Todd, Palino, who’s a senior SRE at LinkedIn discuss How AI is Helping Site Reliability Engineers Automate Incident Response. We discuss both approaches and their potential to bring a little bit of the future into your present.

Automation.ai as your trusted guide

What if you could empower SREs with the insights needed to drive improvements? What if instead of the typical war rooms and on-call burn out, SREs had a trusted guide to quickly fix problems?

Broadcom provides a “just add water” approach that can help your IT teams automate incident response through our AI-driven, self healing platform automation.ai. Leveraging our deep domain expertise, we can help your SRE teams prevent alert fatigue by triaging alerting rules continuously using a combination of notification rules, process changes, dashboards and machine learning (ML) to proactively monitor the SRE four golden signals and measure what really matters for customer experience.