Thursday, June 5, 2008

Why IT monitoring is costing the industry money

I was giving a presentation to an analyst today, describing how an optimized IT infrastructure is inherently energy efficient.

And then it occurred to me: The entire IT monitoring and reporting sector (those guys who write software that pages you when something goes wrong in your data center) is perpetuating waste.

The software assumes that there's a problem only when service level agreements (SLAs) are too low -- but never when they are too high. This implies that alert storms get triggered when you're under-provisioned. But when you're over-provisioned, it's bad too... too much capital being wasted delivering an SLA that's better than needed. This scenario is probably replayed during every off-peak hour a data center operates.

What you don't measure, you can't manage. And therein lies the waste being perpetuated by IT: it's been implicitly assumed that too much infrastructure is OK.

Actually, what we need is a monitoring and control system that maintains an optimal service level -- not too high, not too low. And, when demand changes, automatically adjusts resources to re-optimize the SLAs. That adjustment might include re-allocating or de-allocating hardware, or re-provisioning servers on-the-fly.

Just once, I'd like one of my IT friends to get an alarm delivered to his pager that reads "system critically over-provisioned: wasting power"

2 comments:

James Urquhart said...

Ken,

Beautiful. I love both the simplicity and the exactness of your argument. "Always on" alone is a waste, but "Always on at peak capacity" is many times worse. There is nothing more green than optimal data center management.

Looking forward to more insights like this one.

James

Unknown said...

can't think of a better example of "if you have a hammer all you see is nails" phenomena.
IT monitoring is the cause of energy waste? what else, rising food prices perhaps? I'm sure there is some way we can tie that to IT monitoring. give me a frigging break!

How wrong can you get? let see:
- "the "entire" it monitoring sector reports only when SLAs are too low and not when there are too high"
No. many monitoring tools report what the availability level is etc often as a function of time. The alarms mentioned are only tiny piece of what monitoring tools do, to take corrective action. Weekly, monthly reports show much more detailed information. May be you should ask a monitoring guy what IT monitoring tools do before putting the blame for the next world war on them.
- SLAs too high? you know anybody who has one? IT monitoring tools do not determine what the SLA level should be. The business (customers/users) decide that. You think their service is "too available" take it up with them. Lets see they agree with you. By the way, which service you use have too much availability? better tell them so they can take it down once a while or make it slower for you? In any case, monitoring tools just "report" what the availability and performance is, don't make judgment on what they should be. They are the messenger.
- further. it monitoring tools actually "enable" energy savings by reporting the utilization of various resources and making it possible to take action to turn on/off systems.

May be all you're trying to say is we should turn off systems that are not in use (as Cassatt does?). Fine suggestion. Going from there to "IT monitoring is costing the industry money" is annoying (removed several other stronger terms) to say the least. IT monitoring may be costing money to the industry in some ways, this ain't it.