Everyone uses some sort of monitoring. Networking, service, host, we do it all. It's good. We need it. How else would we know if there was a problem?
I'm sure we could all agree that our monitoring could even be improved. Maybe we should tweak the alert escalations a bit. Maybe we should stop sending pages for a nagging bad connection that doesn't really matter--whatever.
Key Performance Indicators and S/N Ratios
Are those modifications and tweaks really that important to the service level the end user experiences? Are they directly related to your
Key Performance Indicators (KPIs)? Do they cause a quantifiable impact upon the traffic to your site? Even then, is that impact directly affecting your bottom line?
If not, should you still continue to monitor them? Maybe so. But if you do, it's best to ensure they don't decrease the signal-to-noise ratio if your alerting. Otherwise you might end up with a situation like this:
KPIs can be easily underestimated and misunderstood. A KPI is
not how many hits your site gets over a period of time. While that may be important, it's not a KPI. KPIs are usually subjective--simple examples may be how many conversions, how many new users sign up from invitations, etc.
These are the end-result metrics that should be watched like a hawk. If you monitor other aspects of your network (and you probably should), make sure you understand how these affect your KPIs.
Alert Thresholds
One pitfall of monitoring that happen a lot is the misunderstanding of how to best monitor rate-based metrics. If there is a need for monitoring bandwidth usage, incoming requests, disk usage, etc., the typical approach is to define a static watermark and alert above/below that.
Is that truly the best method, though? How does knowing that your RAID volume is at 80% now going to help you six months from now? What you probably really want to know is the rate of increase in disk usage. If you were Twitter and you wanted to monitor the rate of new tweets, wouldn't you want to know if that suddenly decreased over the past five minutes?
Putting It All Together
Make sure you use these to
enhance your monitoring systems. Without monitoring your database load, without monitoring the number of threads used on an app server, you really won't know what's causing a sudden drop in activity on your site or app. A holistic approach is necessary to provide a complete view of the health of your network.
Think about how you can apply these ideas to your own monitoring. Applying principles like these will provide multiple benefits: Not only will you probably get paged less for non-critical problems, but you might actually wind up increasing the level of service you provide to your customers!