Many years ago, I came in one morning to a client, to discover the website down and the email server completely unresponsive. Naturally, we assumed that we we’d been hacked. What else could take down two unrelated systems at the same time?
After several hours of trying to get these systems back online, we discovered that they weren’t as unrelated as we’d originally assumed. The problem hadn’t been hackers after all but rather a rather naive assumption on the part of one of the programmers.
The programmer had put logic into the website so that if it was unable to connect to the database, it would send an email to the help desk. Overnight, the machine hosting the database had run out of disk space, making it impossible for the website to connect. So when someone tried loading the website, an email would be sent.
That doesn’t seem horrible on the surface. The problem was that an email was sent on EVERY failure. By the time we arrived in the office, there were already thousands of emails being sent to help desk and the email server was overwhelmed. The more we attempted to troubleshoot, the more emails were sent and the less responsive the system became.
It turns out that this is a well understood problem in the monitoring space and that anyone using proper monitoring systems is quite safe from this. The basic pattern is that we log every error and the monitoring system notifies support once in total, not once per error.
It’s really a little more complicated than that as there are escalations and reminders but the point is that a small number of notifications are sent, regardless of the number of actual errors that happen. It’s a well understood problem.
Imagine my surprise then, when I discovered a team recently that was making this very same mistake. Sending a new email on every error. They had no idea of the problems that this would cause and likely nobody would discover it until well after it was in production.
The real lesson here isn’t that we should use monitoring tools1, although we should. The lesson is that most of the problems we’re solving today, have already been solved before and we should be learning from the past and not continually reinventing the wheel.
We have a tendency to take inexperienced people and have them work as fast as possible, all by themselves. This results in mistakes that could have easily been avoided and causes us to solve problems that have already been solved. Both dangerous and wasteful.
Real collaboration will help. Pairing up more senior people with less experienced will help. Taking the time to learn, instead of always focusing on delivery, will help.
Take the time and learn from the past.
-
If the idea of monitoring really is new to you, then also check out observability which is the next generation of that. ↩