Tuesday, February 23, 2010

Release-ban: Leveraging Kanban to Manage Your Releases

The process of releasing your software to production can be quite heavy and arduous. Even with some of the concepts, tools and APIs available to us today, there are usually a number of steps required to simply copy it over to a set of hosts. You need to run tests, stage the release, etc. If you think about it, the release process itself resembles an assembly line.

What if we took that a step further and actually turned it into a real assembly line? Using Kanban, your organization could then easily visualize bottlenecks as a piece of software progresses through the stages of a release.

Having issues with a pre-release step such as integration testing? You'll know because a bunch of releases will be stuck in the Release Backlog.

Is it too tedious to actually push the software to production? You'll know because a bunch of releases will be stuck in the Staging state.

How does this work, exactly?

Let me show you with an example Kanban board designed for releases:


We have a complex environment consisting of four development teams that need to perform releases on their own schedule. The initial layout gives us a Release Backlog and three states in which a release can exist. Pre-release would be performing any configuration prep, QA testing, etc. The other two should be self-explanatory.

Here's where Kanban really shines. Using WIP (Work In Progress) limits, we can enact constraints in specific points of the release process.

 

The WIP limit depicted above means that only three pieces of software can be in "Staging" at any given point. A team cannot begin staging a fourth release until at least one of the three has begun. A similar limit can and should be implemented in the "Production" state.

If there are issues in the release process, they will cause a backup of items in the prior states and/or backlog itself. Visualizing the states of all current and upcoming releases on the Kanban board will make it more obvious where the bottleneck exists.

One of the great things about Kanban is that it allows a lot of flexibility. You can define a WIP limit to be virtually anything. For instance, you could weight releases based on the length of time they take. Or you could weight them based on the relative impact to your site if that software's release went badly.

There are some Kanban specifics that I didn't detail because they are a bit out of the scope of this blog post. If you find the concept interesting I urge you to look into it.

Friday, February 5, 2010

What Does Monitoring Mean To You?

Everyone uses some sort of monitoring. Networking, service, host, we do it all. It's good. We need it. How else would we know if there was a problem?

I'm sure we could all agree that our monitoring could even be improved. Maybe we should tweak the alert escalations a bit. Maybe we should stop sending pages for a nagging bad connection that doesn't really matter--whatever.

Key Performance Indicators and S/N Ratios

Are those modifications and tweaks really that important to the service level the end user experiences? Are they directly related to your Key Performance Indicators (KPIs)? Do they cause a quantifiable impact upon the traffic to your site? Even then, is that impact directly affecting your bottom line?

If not, should you still continue to monitor them? Maybe so. But if you do, it's best to ensure they don't decrease the signal-to-noise ratio if your alerting. Otherwise you might end up with a situation like this:

 

KPIs can be easily underestimated and misunderstood. A KPI is not how many hits your site gets over a period of time. While that may be important, it's not a KPI. KPIs are usually subjective--simple examples may be how many conversions, how many new users sign up from invitations, etc.

These are the end-result metrics that should be watched like a hawk. If you monitor other aspects of your network (and you probably should), make sure you understand how these affect your KPIs.

Alert Thresholds

One pitfall of monitoring that happen a lot is the misunderstanding of how to best monitor rate-based metrics. If there is a need for monitoring bandwidth usage, incoming requests, disk usage, etc., the typical approach is to define a static watermark and alert above/below that.

Is that truly the best method, though? How does knowing that your RAID volume is at 80% now going to help you six months from now? What you probably really want to know is the rate of increase in disk usage. If you were Twitter and you wanted to monitor the rate of new tweets, wouldn't you want to know if that suddenly decreased over the past five minutes?

Putting It All Together

Make sure you use these to enhance your monitoring systems. Without monitoring your database load, without monitoring the number of threads used on an app server, you really won't know what's causing a sudden drop in activity on your site or app. A holistic approach is necessary to provide a complete view of the health of your network.

Think about how you can apply these ideas to your own monitoring. Applying principles like these will provide multiple benefits: Not only will you probably get paged less for non-critical problems, but you might actually wind up increasing the level of service you provide to your customers!