Outage Simulator

Outage Simulator

Niall Murphy

Hi everyone - we’re pleased to announce the BETA launch of a tool SREs might get a kick out of: an interactive outage simulator

‍What can you use this for? Well:

Training: It’s useful for people to understand the dynamics of how time-of-day affects outages, the difference between lots of little outages and one big one, and how thinking about things in terms of CUJs can be useful.

Modelling: Want to make a high-risk change? Feed your traffic pattern in and test what happens if you have an outage at the scheduled time. Or divide your traffic into CUJs and see what happens if you can protect some of them. Or write some code to estimate bounce-back after outages.

Of course, all this behaviour is extendable and changeable. So if you have traffic shapes, income models, or user behaviour data that you would like to model, contact hello@stanza.systems and let us know!

More tactical instructions follow:

On launch, you’re presented with a window with a traffic shape on the left and controls on the right. If you move your mouse cursor over the traffic shape you should see a blue bar track it as it moves - that’s where the outage will hit when you click! Click away a bunch of times and you should see the traffic take a nose-dive. Use the scroll-wheel or the “Outage duration” UI element to control how long it lasts for - how wide the bar is, in other words. “Outage severity” affects how deep the outage is when it happens.

Currently the graph is hard-coded to show a day’s length of traffic in 1440 minutes.If you look towards the bottom, you should also see our way of calculating how damaging your destructive impulses have been - the tool assigns an arbitrary revenue number (changeable with the “Income model” dropdown) to the area under the curve, and keeps track of how much each outage costs. You can also change the shape of the traffic graph with the “Traffic model” dropdown, in case your pattern looks different.But the real magic happens when you change “CUJ model” to “E-commerce”. Then you see the traffic graph will update to show stacked tiers of traffic - these represent the famous critical user journeys, an emerging paradigm in observability. In our case, though, behind the scenes we also modify the outage generation routines to show how, if you protect your most important ones (checkout!) you get to survive outages much better.

Niall Murphy

Niall Murphy is an award-winning, best-selling author, speaker, industry leader, and executive. He has been working in Internet infrastructure since the mid-90s, in a variety of roles from systems, network, and software engineering, and ranging across individual contributor to director scope.

He is best known for the Site Reliability Engineering book and associated works.

He lives in Dublin, Ireland with his wife and two children, and holds degrees in Computer Science, Mathematics, and Poetry Studies.

share