The Insights Manifesto

A missing paradigm in observability, and what to do about it.

Niall Murphy

The Insights Manifesto

It’s so much of a cliché that it’s a cliché to write it down: software ate the world, but data is having it for dessert. Yet in my slice of the world - cloud, ML, production software, SRE, DevOps - the dinner party is dealing with indigestion, not inspiration.

Why? It’s not as if we in the cloud world are new to data - arguably we’ve had operational data since the invention of the logfile in the 1970s. We also have a lot of data: in 2022, for example, an experiment with a single app on Kubernetes generated ~500Gb/month (and it hasn’t gotten smaller since). But again, it’s hard to find people in my world who think that this continually expanding data-swamp is worth it. At best it’s the reluctant cost of doing business. The lore and art connected to monitoring production systems - dashboard maintenance, system analysis, etc - is a hugely complex and demanding set of tasks. It’s generally viewed as necessary, but the expenditure of time and money to stay on top of everything is increasing. Often it seems we’re running to stand still.

So again I ask - why? My answer: I think we’re missing a higher-level construct. We have an incredible amount of data, we even have a lot of information, but we have hardly anything that summarises the situation effectively, ranks or prioritizes, distills important facts, understands context, and keeps track of long-term trends without being explicitly told what to track. In other words, an experienced SRE to look over your shoulder and provide guidance. But there’s not enough of those to go around. So what do we do?

At Stanza, we think of this in terms of a framework we developed related to the OODA loop, very familiar to folks with a corporate decision-making or military background. In our case it started with something we called Signals, Comprehensions (later renamed Insights), and Actions. Broadly speaking, signals is the data you use to make decisions - as we said, there’s astonishing quantities of this available, and an almost equivalent number of tools to work with that data. Actions is also roughly in the same boat - automation frameworks and/or support of many kinds is widely available. Perhaps not quite as consistent or applicable as the observability sector, but there are definitely a huge number of options you have if you want to buy or use something external to make your situation better.

And then to the Insights market - emerging, I’ve spoken with a lot of people who see this gap - but with a dearth of available tools. 

Where is the tool that can look across your business and surface the things you will need to care about, but can’t see directly right now? Your customer behaviour data tells you vital things about your business, and it’s sitting right there in your server metrics, but that’s not often well understood or actively used. What software is it that can realise your storage consumption is going up, match it with your cloud bill, match that with your critical user journey mix, and understand that you need to push 7% of customers onto a lower-cost flow so the consumption/revenue curves will line up?

All of this is possible - indeed, it’s done every day in other domains, BI is applied to lots of things successfully - but you’ll be sorely disappointed if you go looking for something to fix it. If you’re extraordinarily lucky, your team has a few good SQL queries on some data-lake or other, or maybe even something in Bigquery. Expecting even that is often a reach. Our time down the hyperscaler mines showed us that software can do a lot in this domain - I recall with particular fondness a capacity planning tool that literally made weeks worth of work every quarter simply vanish - but existing companies are actually more limited by what they imagine can be done rather than what can be done.

Again, I ask why, and I know, in asking, many of my readers will understand why. It’s the same thing it always was. The view that operations is a dirty word; that operations is only/ever a cost center; that features and revenue growth are the only things that matter. There are many problems with this reductive view, but one large one that I’m surprised more businesses don’t see is that the answer is right in front of them. The disdain for operational data flows from its supposed distance from revenue, but often it’s the data that’s closest to the customer. Declaring one portion of your business - one often directly customer-facing - to be much less important than the others is self-defeating.

Principles

Once you start seeing things through this lens it’s hard to let it go. So we’ve been thinking about what a system like this might look like, and how it might work.

In our view, an insight is text, based on data, with supporting information (indeed, graphs where relevant, but not graph-oriented), about a particular feature of your production or business. It tells you something about your systems or your customers or your business - something useful. Importantly though, it’s not a real-time alert. If you should page for something, page for it. An insight shouldn’t tell you you’re going to run out of disk space in the next 2 hours. It should tell you that your curve of storage costs versus your curve of income don’t match. Strategic, not tactical.

Those are definitional aspects of insights, but we should also outline the attributes a good insight has:

  • An insight isn’t an FYI. It should be actionable, or point out something that will be actionable.
  • A sufficiently large collection of insights is indistinguishable from yet another infinite to-do list. Therefore each insight needs to come with a priority - some sense of how important it is. We ultimately can’t decide on the actual importance for the end-user, so we should surface everything we see, but the user’s attention deserves our respect.
  • Insights need to be trustable. They shouldn’t be vague, or based on technologies that hallucinate. When you see one, you need to be confident it’s saying something true.
  • Insights should never presume background on the part of their reader, and should come with references to allow the reader to contextualise them. (Many of us had have alerts that said “System X is bad!” and nothing else. Insights shouldn’t be like that.)
  • If what an Insight surfaces would benefit from a remediation or a mitigation, the Insight should supply same. It should be generic where possible, because then it’s a valuable learning tool (“oh look, this technique/mitigation tool is applicable in more ways than I thought!”) and consequently also benefits from working with open interfaces.
  • Insights should provide a link to the chain of evidence or data sources that allowed the insight to be derived. They should not be issued ex cathedra, unquestionably.

Overall, insights should be actionable, specific, relevant, and provide helpful context.

What should an insight do for you?

Having said all the above - what would an insight do for you, such that you would care?

It should help you make a stabler, safer, more scalable, more efficient system.

It should help you not to forget anything obvious, and suggest things that aren’t obvious.

It should be a support for teams, and a co-pilot for individuals.

The reality is, everyone needs credible, actionable, useful insights into what’s happening, and getting those today is much harder than it should be.

Send hello@stanza.systems email today to see how we can help each other on this journey!

Niall Murphy

Niall Murphy is an award-winning, best-selling author, speaker, industry leader, and executive. He has been working in Internet infrastructure since the mid-90s, in a variety of roles from systems, network, and software engineering, and ranging across individual contributor to director scope.

He is best known for the Site Reliability Engineering book and associated works.

He lives in Dublin, Ireland with his wife and two children, and holds degrees in Computer Science, Mathematics, and Poetry Studies.

share