How many alerts are too many?

Share on:

Recently I came across a Reddit post where the author wanted to know if their on-call situation was standard or if it was out of the ordinary. The post describes a tire fire, not a production-ready software system:

1Is it normal to get alerted multiple times each and every hour of oncall, and then expected to go in to work?
2
3Oncall is 12am to 8am. Alerted every 15min from 12-2am, and then every hour after that. I am then expected to come in from 10am-5pm.
4
5This is the norm these days for our oncall shift. It’s one week long.
6
7Is this normal??? This seems absurd and terrible for my health

Thankfully the vast majority of the replies said that this is not normal situation. And I wholeheartedly agree that it shouldn't be normal.

A sampling of the replies to the Reddit post:

1No, that isn't normal at all. That sounds like a system that isn't close to being production quality or the good old "crying wolf" alerts that aren't actionable or useful.

and

1No, that's insane.
2
3When a system is constantly on fire all non-firefighting work should be stopped and all hands brought on-deck to address the root causes of the issue until the system is stabilized.

and

1Being alerted more than once every two weeks or a month is a problem with the system. More than once a day is terrible; multiple times an hour? there’s no describing it; completely broken useless mess.

The post brought up old memories for me - I've had prior jobs with similar levels of on-call alerts. At the time, I also didn't know if this level of alerting was normal or not. Was I being unrealistic that I wanted to be able to generally sleep through the night when I was on-call? Or was a system that was noisier than a colicky newborn just normal? And at the time there weren't many online resources published on this topic.

So I wanted to write this post to:

  1. Add my voice to the choir that high levels of software alerts are not normal - and you don't have to tolerate it as an on-call engineer
  2. Share stories of how I helped burn down this problem in different teams in the past with the hopes that it might help others solve similar problems on their teams

Sleeping next to laptop

Years ago my first role years ago with an on-call rotation was rough introduction into the on-call world - along the lines of this Reddit post.

That system fired a page every time there was an exception. Any exception. And there were a lot of exceptions. Exceptions on service-to-service calls (with no retries), exceptions when a user asked for a record that didn't exist, and more.

In all, there were dozens of pages during each work day - and several each night.

During my first on-call shift, there were so many pages I slept in my basement next to my laptop to avoid repeatedly waking up my wife as well. We had a newborn at home who was finally starting to sleep through the night - no need for both my wife and I to wake up each time I got paged.

This situation wasn't great. Right away I started working through some of the issues that were causing all these alerts:

  • Adding retries between microservices to guard against the inevitable network hiccups in a distributed system
  • Remove alerts for client caller issues such as passing in bad input data, asking for records that didn't exist, etc.

Also, this system was wholly reliant on a third-party service that was very flaky. We had an uptime SLA in place with this third-party, but that SLA was continually violated and there weren't any remedies in place with enough teeth for them to take action. So nothing changed. If your system is reliant on a third-party service with no alternatives in place, an SLA won't fix their reliability issues. This SLA wasn't worth the paper it was printed on.

Instead, we eventually built an in-house version of that capability so we were no longer dependent on this flaky third-party service. As we were rolling out our in-house version, we first used it as a fallback whenever the third-party service failed. We used that fallback during the busiest time of the year, and the fallback repeatedly triggered when the third-party service inevitably failed. Our customers (and our pagers) were now no longer effected by these third-party failures. (We also eventually switched to only use our in-house solution once we fully trusted it.)

With these (and other) changes, after a few months we went from dozens of pages a day down to days between pages. Not perfect, but much better than sleeping next to a laptop in the basement.

26% of every hour

Later, I worked at one company where in my first week on the job I got a concerning engineering-wide email. The email was about the results of their SRE team's investigation into the on-call load throughout the company. The results described a wide variance in the company's on-call burden, from manageable to out-of-control.

This wasn't a good sign. But maybe I'd be lucky and my new team would be on the manageable side, right? Aiming to find out, I got access to PagerDuty ASAP and checked the alert counts across services. Bad news - my team had the second-highest number of alerts across the entire company - with over 100 alerts in the prior week. Not good.

First, I tried to bring visibility into problem. PagerDuty is great at the mechanics of paging people - setting up rotations, escalation policies, alert methods, etc. But I've found minimal reporting capabilities about a team's on-call load.

You can get alert counts in a given time frame from PagerDuty - but alert counts alone don't tell the whole story. If there is a major incident, many alerts may fire in a short time - all related to the same issue. Twenty alerts during the one incident at 2pm on a Tuesday are not as impactful to the on-call engineer as 20 alerts spread from 1-5am each night during the week. And at the time I couldn't find a good way to get more impactful alert reporting data from PagerDuty.

But PagerDuty does have ways to get the raw data of alerts - such as CSV export or API access - including the exact times that alerts fired. With that raw alert data, I built tooling to analyze the frequency and impact of alerts. Alerts during the work day aren't great, but alerts at 2am are much worse.

Talking with folks about just the plain number of alerts on the team wasn't changing any hearts or minds. But once I analyzed the data and showed that over the prior quarter:

  • 26% of every hour (workday, nighttime, weekend, etc.) had at least 1 page and
  • 5 out of 7 nights a week had at least 1 page

Those numbers made the alert burden much more palpable. A system that pages a person 26% of every hour? That's closer to a Mechanical Turk machine powered by humans than actual automated software.

With this more visceral representation of the on-call burden, it was easier to get buy-in to fix the problem. Over the course of a couple of months, we analyzed the source of each alert and several team members and I burned down those alerts until we had days between alerts. Again not perfect, but much better than 100 alerts a week.

Taming frequent alerts

How did my teams and I work through the issues causing our noisy pages? Unfortunately in my experience there isn't one silver bullet that magically fixes out-of-control alerts. Once a system is in this noisy state, there are likely a few bigger issues and also a long-tail of other issues that need taming. It will likely take a combination of:

  1. Solving legitimate issues in the system itself
  2. Making the system more resilient against failures and/or latency in dependent systems
  3. Tuning actionable alerts so they only fire when there is significant user impact and human intervention is required
  4. Removing un-actionable alerts

Bring visibility into the problem

One of the first steps in taming an out-of-control alerting situation is to show the right people the severity of the problem. Again, the raw number of alerts may not be convincing enough to leadership for them to take action.

Depending on the scope of the alerting problem, I've used some of these following approaches to raise awareness of the alerting burden:

  • A weekly email of the business-hour and off-hours alerts for each team in an organization (including the delta from the prior week), sent to every manager and senior leadership. And including links to investigate further into longer-term trends as well as dive into each alert itself.
  • Tooling to analyze how many hours of the day had at least one alert, how many nights of the week had at least one alert that woke up the on-call engineer, etc. (as I mentioned earlier)
  • Analyzing and posting the signal-to-noise ratio of the past week's alerts. For example, a signal-to-noise ratio of 33% if there are 3 real alerts and 6 false-positives in the past week.

In addition to helping to get buy-in that the on-call situation is not sustainable, these metrics can be valuable to track over time to measure the impact of your team's work to burn down the alert burden. And help to ensure the problem doesn't get worse again in the future.

Review every alert and take action

Review every single alert that fires in the team and disposition it. Was it caused by a legitimate issue in the system that needs to be fixed? Great - create a ticket to fix it and ensure it is prioritized. Or was the alert just noise? Then tune or remove that alert.

If the system is starting from a point where there are many alerts, you may get pushback that it's too time-consuming to review every alert. But in those situations it's especially critical to spend the time to review each alert and figure out the best way to resolve it so it doesn't happen again. Otherwise, good luck on fully solving the alerting problems.

Ensure every alert has a runbook

If an alert fires at 2am and there isn't a corresponding runbook detailing what steps the on-call engineer should take - what do you expect to happen? Especially if the on-call engineer isn't an expert on that part of the system?

Create a runbook with the steps for the on-call engineer to follow for each alert. And link directly to the specific runbook from the alert message so the on-call engineer doesn't have to go hunting for the runbook when the alert fires - saving precious time in restoring service faster.

The runbook doesn't have to be fancy, but at a minimum include:

  • The impact when this problem occurs. How severe is the issue? What business processes, other systems, etc. will be effected?
  • What steps to take to investigate the cause - links to observability queries, etc.
  • How to mitigate impact as soon as possible. Are there feature flags / control rods available to help, etc.?

If you can't write a runbook for a given alert, is it actionable? What do you expect the on-call engineer to do when it fires? Should this alert exist?

Conclusion

If you're an engineer in a situation with frequent pages like this, know that it is not normal and you don't have to accept it. Bring visibility to the problem and tackle the alert burden as a team.

If you're an engineering leader and you don't know the on-call load on your team, build that visibility right away. And if your team's on-call load looks anything like these stories - create space for the team to tackle this issues. Be a leader in action as well as title. Or risk your team's productivity being crushed under this on-call weight.