Wednesday, May 5, 2010

Managing Users With Puppet from the Command Line

While in the process of deploying Puppet, I had the sudden need to manage some users' passwords. Since I already had the software installed, I thought ralsh would be the easiest way to handle the job. I could have scripted a bunch of ssh/awk/sed/etc commands, but that just didn't seem very robust. Sure, it'd have worked fine back in 1999, but this is Two Thousand Freaking Ten.

Here's the script I wrote.

WARNING: You'll need to disable history or otherwise remove the shell history file after relogging.

Easily force PXE boot in Linux without IPMI

Sometimes I need to PXE boot a box for reinstall and don't have easy access to the BMC. Sometimes I don't want to try to race the BIOS prompt for PXE booting. What can I say, I have a short attention span, er, other things to do while the BIOS posts.

One option I could use is ipmitool, but I like to just overwrite the MBR with dd.

Here's the command I use:

dd if=/dev/zero of=/dev/sda bs=512 count=1

Tuesday, April 27, 2010

How to Calculate CIDR, Netmask, etc. For Reals.

I've always been kind of bad with math (yeah I know). I mean, I'm good at stuff like 22*15. (22*10)+((20*5)+(2*5))=330. I can do that in my head.

But when starting in my career, binary math seemed so overwhelming. So I didn't learn it. By the time I got to a point where I needed to calculate CIDR addressing, I had a wonderful tool called the Internet search engine.

So I've been a sysadmin for a long time now, I guess. A sysadmin who could not perform binary math.

No more. Tonight I finally sat my ass down and figured it out. I tried searching explanations online but found very little practical math. A lot of it seemed really complicated and, still overwhelming. After banging my head and pulling my hair (simultaneously, too--what an accomplishment itself!), I finally had that epiphany and sorted it out in like five minutes.

Because interview questions always seem to "base" (Hurrrr, I made a pun.) on the network bits, that's all I am going to address (Look Ma, another one!) in this post.

Here's how to do it:

Given a block /N, find the number of usable addresses and the netmask.

Each octet consists of two sets of 4 bits:

0000 0000 0000 0000 0000 0000 0000 0000

so take N/4 to find the exact number of bits in the network address:

/23 => 23/4=5.75

Fill in 5.75 of those sets with a 1:

1111 1111 1111 1111 1111 1110 0000 0000

To find the number of usable addresses, calculate 2^the number of host bits-2.

2^9-2=510 usable addresses

Now let's find the netmask. This is the trickiest "bit": section up the octets and find the number of host bits in the last octet that is not 0. In this case it is the 3rd octet. Subtract that from 255, and you have that octet in the netmask. All prior octets are 255 (1111 1111) and any subsequent octets are 0 (0000 0000).

255-(2^0)=254

Netmask: 255.255.254.0

Obviously this is the second easiest possible netmask to calculate (the easiest being 0). Let's try a harder block: /17.

/17 => 17/4=4.25

1111 1111 1111 1111 1000 0000 0000 0000

2^15-2=32766 usable addresses

3rd octet "1000 0000" (working 0s right->left)
255-(2^0+2^1+2^2+2^3+2^4+2^5+2^6) = 128

Netmask: 255.255.128.0

I hope you find this useful. I feel it explains it in a much more straightforward way than most resources I've found online.

Monday, March 1, 2010

Dev is a Production Environment Too!

(Note: I originally made this post on my employer's tech blog.)

Prod is important–we have to keep it up!”

Yes, it’s of utmost importance. Is it the only important environment you support as a (Web) Operations team? How much attention do you devote to your Development and Staging/QA environments? What is the value you place in them?

The Benefits of Well Maintained Environments

Providing equal support to the quality of your ancillary environments can be quite advantageous. Even if you ensure a solid transitional process of supporting software upgrades from Dev through Staging to Production, you could still run into unnecessary cost-inducing issues.

Instead of having to debug oddities in software misconfigurations, engineers could have additional time to focus on ensuring the operationability of their apps. Wouldn’t it be great if you could increase the productivity of your users (developers), who then use that increase as a catalyst to give you better tools and APIs? Suddenly both groups exist in a symbiotic love-bubble. Amazing.

Living in an Agile World

It can sometimes be difficult for Ops folk to find their way in a scene where everyone around them is pushing for faster support response. We’re trying to keep production as stable as possible, keeping up with requests for multiple releases throughout the day. Engineers need a bunch of new software installed for the next version of their Whiz-Bang app, but how can we ever have time to get it all installed in multiple environments and test its security and stability?

Fortunately, there are a few possible solutions. The most obvious is using a configuration management tool such as Puppet (there are others, but Puppet is a personal favorite). Puppet is the perfect tool for an Agile Sysadmin living in an Agile World. It allows you to define specific resources on a node (host) in a declarative manner. Need to upgrade Ruby or the JVM on a box? Easy, just push a new manifest.

Another option is to require supporting software be bundled directly with the in-house application. This is pretty ingenious, because it offsets the burden of configuration to the software engineer. It also enforces a strict version dependency throughout the development life cycle. The affinity between the in-house app and its supporting software versions might easily get lost as the app gets pushed from Dev to QA and Staging to Prod.

There are a couple caveats to this method, however. For example, in relinquishing some control of configuration to Engineering the two groups must communicate more diligently to ensure the software is functioning properly. Possible security vulnerabilities of the supporting software must also be considered. Who is responsible for upgrading the software if it is checked into the source control repository instead of being installed directly on the host as part of the provisioning and/or release process?

Regardless of the method you choose, you will end up with a more reliable and easy to maintain set of environments. This will be a big win for everyone — Engineering, QA, Ops, the whole company.

Tuesday, February 23, 2010

Release-ban: Leveraging Kanban to Manage Your Releases

The process of releasing your software to production can be quite heavy and arduous. Even with some of the concepts, tools and APIs available to us today, there are usually a number of steps required to simply copy it over to a set of hosts. You need to run tests, stage the release, etc. If you think about it, the release process itself resembles an assembly line.

What if we took that a step further and actually turned it into a real assembly line? Using Kanban, your organization could then easily visualize bottlenecks as a piece of software progresses through the stages of a release.

Having issues with a pre-release step such as integration testing? You'll know because a bunch of releases will be stuck in the Release Backlog.

Is it too tedious to actually push the software to production? You'll know because a bunch of releases will be stuck in the Staging state.

How does this work, exactly?

Let me show you with an example Kanban board designed for releases:


We have a complex environment consisting of four development teams that need to perform releases on their own schedule. The initial layout gives us a Release Backlog and three states in which a release can exist. Pre-release would be performing any configuration prep, QA testing, etc. The other two should be self-explanatory.

Here's where Kanban really shines. Using WIP (Work In Progress) limits, we can enact constraints in specific points of the release process.

 

The WIP limit depicted above means that only three pieces of software can be in "Staging" at any given point. A team cannot begin staging a fourth release until at least one of the three has begun. A similar limit can and should be implemented in the "Production" state.

If there are issues in the release process, they will cause a backup of items in the prior states and/or backlog itself. Visualizing the states of all current and upcoming releases on the Kanban board will make it more obvious where the bottleneck exists.

One of the great things about Kanban is that it allows a lot of flexibility. You can define a WIP limit to be virtually anything. For instance, you could weight releases based on the length of time they take. Or you could weight them based on the relative impact to your site if that software's release went badly.

There are some Kanban specifics that I didn't detail because they are a bit out of the scope of this blog post. If you find the concept interesting I urge you to look into it.

Friday, February 5, 2010

What Does Monitoring Mean To You?

Everyone uses some sort of monitoring. Networking, service, host, we do it all. It's good. We need it. How else would we know if there was a problem?

I'm sure we could all agree that our monitoring could even be improved. Maybe we should tweak the alert escalations a bit. Maybe we should stop sending pages for a nagging bad connection that doesn't really matter--whatever.

Key Performance Indicators and S/N Ratios

Are those modifications and tweaks really that important to the service level the end user experiences? Are they directly related to your Key Performance Indicators (KPIs)? Do they cause a quantifiable impact upon the traffic to your site? Even then, is that impact directly affecting your bottom line?

If not, should you still continue to monitor them? Maybe so. But if you do, it's best to ensure they don't decrease the signal-to-noise ratio if your alerting. Otherwise you might end up with a situation like this:

 

KPIs can be easily underestimated and misunderstood. A KPI is not how many hits your site gets over a period of time. While that may be important, it's not a KPI. KPIs are usually subjective--simple examples may be how many conversions, how many new users sign up from invitations, etc.

These are the end-result metrics that should be watched like a hawk. If you monitor other aspects of your network (and you probably should), make sure you understand how these affect your KPIs.

Alert Thresholds

One pitfall of monitoring that happen a lot is the misunderstanding of how to best monitor rate-based metrics. If there is a need for monitoring bandwidth usage, incoming requests, disk usage, etc., the typical approach is to define a static watermark and alert above/below that.

Is that truly the best method, though? How does knowing that your RAID volume is at 80% now going to help you six months from now? What you probably really want to know is the rate of increase in disk usage. If you were Twitter and you wanted to monitor the rate of new tweets, wouldn't you want to know if that suddenly decreased over the past five minutes?

Putting It All Together

Make sure you use these to enhance your monitoring systems. Without monitoring your database load, without monitoring the number of threads used on an app server, you really won't know what's causing a sudden drop in activity on your site or app. A holistic approach is necessary to provide a complete view of the health of your network.

Think about how you can apply these ideas to your own monitoring. Applying principles like these will provide multiple benefits: Not only will you probably get paged less for non-critical problems, but you might actually wind up increasing the level of service you provide to your customers!

Thursday, January 14, 2010

Operations Anti-Patterns

Anti-patterns, as defined by Wikipedia, are "design patterns that may be commonly used but is ineffective and/or counterproductive in practice." Most commonly used when discussing the topic of software engineering, it can be applied in other areas as well.

In an attempt to come up with ways to evolve some of the tasks we practice as every day sysadmins, I thought it might be good to start by defining some Operations anti-patterns.

(Obviously this is nowhere near a comprehensive list. I hope to post more soon.)
  1. Information overload: When an admin creates a cronjob to perform an automated task but doesn't take steps to ensure unnecessary output is discarded.
  2. The rat's nest: One could link a sysadmin's tidiness in the cage and the way they maintains their systems, both software and hardware. The rat's nest almost guaranteed that other aspects of their work are as disorderly, thus causing lost productivity (or in some cases, extending downtime).
  3. Set it and forget it: Setting up a new piece of software or hardware without proper documentation on both its implementation (how and why it exists) and operational (maintenance and support) aspects.
  4. Non-communicado: Ancillary to "set it and forget it", this happens when an admin sets up new system monitors without telling the rest of their team. It could also refer to cron jobs without comments describing them, etc.
Just as with software development, evidence of these anti-patterns do not necessarily characterize a bad sysadmin. Good sysadmins display them as well. But the best, in my opinion, strive to overcome them in their daily work.