Everyone uses some sort of monitoring. Networking, service, host, we do it all. It's good. We need it. How else would we know if there was a problem?
I'm sure we could all agree that our monitoring could even be improved. Maybe we should tweak the alert escalations a bit. Maybe we should stop sending pages for a nagging bad connection that doesn't really matter--whatever.
Key Performance Indicators and S/N Ratios
Are those modifications and tweaks really that important to the service level the end user experiences? Are they directly related to your Key Performance Indicators (KPIs)? Do they cause a quantifiable impact upon the traffic to your site? Even then, is that impact directly affecting your bottom line?
If not, should you still continue to monitor them? Maybe so. But if you do, it's best to ensure they don't decrease the signal-to-noise ratio if your alerting. Otherwise you might end up with a situation like this:
KPIs can be easily underestimated and misunderstood. A KPI is not how many hits your site gets over a period of time. While that may be important, it's not a KPI. KPIs are usually subjective--simple examples may be how many conversions, how many new users sign up from invitations, etc.
These are the end-result metrics that should be watched like a hawk. If you monitor other aspects of your network (and you probably should), make sure you understand how these affect your KPIs.
Alert Thresholds
One pitfall of monitoring that happen a lot is the misunderstanding of how to best monitor rate-based metrics. If there is a need for monitoring bandwidth usage, incoming requests, disk usage, etc., the typical approach is to define a static watermark and alert above/below that.
Is that truly the best method, though? How does knowing that your RAID volume is at 80% now going to help you six months from now? What you probably really want to know is the rate of increase in disk usage. If you were Twitter and you wanted to monitor the rate of new tweets, wouldn't you want to know if that suddenly decreased over the past five minutes?
Putting It All Together
Make sure you use these to enhance your monitoring systems. Without monitoring your database load, without monitoring the number of threads used on an app server, you really won't know what's causing a sudden drop in activity on your site or app. A holistic approach is necessary to provide a complete view of the health of your network.
Think about how you can apply these ideas to your own monitoring. Applying principles like these will provide multiple benefits: Not only will you probably get paged less for non-critical problems, but you might actually wind up increasing the level of service you provide to your customers!
Friday, February 5, 2010
Subscribe to:
Post Comments (Atom)
Hi Scott,
ReplyDeleteIt's useful to track all of the monitors that can off and compare this with the alerts that have gone off as you can use the data to refine your alerts and also to find issues with alerts that may not be working. This can help define KPIs as well by improving S/N.
Also, I've found it useful to trend, trend, trend the results from the monitors, because if I don't know where to set a threshold I can look at a a historical trend to see if I can figure out what is normal and what is an error condition.
Thanks for the post!
-Adam
Yeah, correlation for sure! Being able to graph a mashup of Nagios alerts with munin/gangia/etc would be pretty sweet. Hmm...
ReplyDelete