Tuesday, November 27, 2007

Postcards from the 2007 Gartner Data Center Conference

I attended Day #1 of the Gartner Data Center conference here in Las Vegas today - after making the strategic error of being dropped-off at the MGM Grand lobby, and having to walk what must have been 3/4 mile to the conference center...

Thomas Bittman opened the AM with a keynote on the Future of Data Center Operations. It had a pretty broad coverage of the state of DC Ops today. He had at least one memorable interjection -- What seemed as a warning to equipment vendors who have strangle-holds over customers... strongly urging customers to reject platform-specific IT technologies. He also predicted the emergence of the "meta-O/S" and the "cloud-O/S" which (I think) is a re-packaging of Gartner's Real Time Infrastructure (RTI) story. And that the meta-O/S had to be platform adn vendor-neutral. But this was the first time that I've heard Gartner pay specific attention to the emergence & legitimac of cloud computing (and the "O/S" to run it).

Next, Donna Scott gave an equally broad-ranging talk on IT Operations Management. Again, she conducted her now 5+ year-old survey of IT's biggest pressures. And once again "high rate of change", "cost containment" and "maintaining availability" took top-honors as the largest ulcer-producing pressures facing CIOs. Also true-to-form, she re-iterated that a shared infrastructure (RTI) is inevitable, breaking down the islands of technology in large data centers.


There were also some interesting vendor break-out sessions; take for example, a session on managing power and cooling from Emerson Network Power by Greg Ratcliff. The trend here is also toward an intelligent monitoring and infrastructure. He spoke of localized cooling (even within the rack) needed as rack power density increases. There was definitely reference to "adaptive cooling" and "adaptive power" -- again implying that efficiencies in large data centers can only be achieved through better use of technology, rather than throwing raw horsepower at the heat/power problem.


Finally, one last surprising (to me) datapoint: the general audience was asked who was using virtualization in production - and 1/2 to 2/3 of the audience raised their hands. This definitely drove-home the point that VMs are (and will be) everywhere. However, I combine this observation with the earlier point that data centers will need a management layer, an "O/S", which is vendor-neutral. At the moment, I don't see any of the existing large vendors stepping up to fill this virtualization managment need any time soon.









Thursday, November 15, 2007

Assessing the New Data Center Metrics

I've been reading-up on work that the Green Grid, Uptime Institute and others have been doing to define metrics around data center efficiency. The work is good but in my mind, misses the mark slightly. All of the metrics I've seen thus far are static - That is, they assume some steady-state aspect of the data center... steady compute loads, steady quantity of servers, etc. But that's not how the world works.

Even Detroit knows that autos get different efficiencies based on how & where they're driven... so the metric called "mileage" is actually measured & documented twice -- one for City, one for Highway. Data Centers need something akin to this as well.

Why? Because IT departments operate at greatly different levels; peak (maybe during the day) as well as off-peak (perhaps nights/weekends). Ideally, the data center should know how to adapt to these conditions: re-purposing "live" machines during peak hours; retiring and temporarily shutting-down idle servers during off-peak; removing power conditioning equipment when not needed; turning off specific CRAC units and chillers when not required (i.e. cold days and/or off-peak hours). We need an efficiency metric that indicates how data centers operate Dynamically.

Anyway, here's a quick survey course in what metrics I did find, and what I'd like to see:

The Green Grid on metrics:
  1. Data Center Infrastructure Efficiency,
    DCiE = (IT equipment power)/(total facility power).

    This is supposed to be a quick ratio showing how much power gets to servers, versus how much else is consumed by power distribution, cooling, lighting, etc. Driving this ratio up means you have less overhead wasting Watts. This wouldn't be too bad a metric if it was used and monitored 24x7, i.e. peak and off-peak.
  2. Power Usage Effectiveness,
    PUE = 1/DCiE (just the boring reciprocal)
  3. Data Center Productivity, (a metric to be adopted in the future)
    DCP = (useful computing work)/(total facility power)

    In theory, this is a great metric: It's like saying "how many MIPS per Watt" can you produce? (BTW, the human brain, the most powerful of all computers, consumes somethling like 25W). Anyway, DCP is a contentious metric... because each computing vendor wants to define "useful computing work" with their own (preferential) way of computing. Frankly, this is most useful to measure efficiency at the server level.
The Uptime Institute

In an excellent paper, the Uptime Institute discusses these in "Four Metrics Define Data Center "Greenness":
  1. Site Infrastructure Energy Efficiency Ratio:
    SI-EER which the Institute is currently working to re-cast in more intuitive and technically accurate terms. I suspect this is much like the Green Grid's DCiE, above
  2. Site Infrastructure Power Overhead Multiplier Which is essentially the same metric as the Green Grid's PUE, above
    SI-POM = (data center power consumption at the meter)/(total power consumption at the plug for IT equipment)
  3. Deployed Hardware Utilization Ratio:
    DH-UR = (qty of servers running live applications)/(total number of servers actually deployed)

    This speaks to the real-time utilization of hardware, and IMHO is one of the best metrics for a dynamic data center. It points to how many deployed servers are actually doing work, vs. those that are sitting "comatose". A very promising metric if it's used in conjunction with equipment that constantly optimizes how many servers are "on", and shuts down idled servers, constantly minimizing this metric.
  4. Deployed Hardware Utilization Efficiency
    DH-UE = (minimum qty of servers needed to handle peak load)/(total number of servers deployed)

    This is another great metric - it speaks to the capital efficiency of hardware - how many need to be provisioned and on the floor, relative to how many are being used actively.
In my ideal world, I'd like to see two things to get us to a "City/Highway" style approach:
  • A DH-UR that changes dynamically, constantly being minimized. This implies that only required servers are actually powered-up and active.
  • An SI-POM that was always driven toward a constant ratio, regardless of compute demand. Which implies that, as compute demand falls, servers are retired and other support equipment (power handling, cooling) also shuts down, keeping the efficiency ratio balanced.
I look forward to conversations with the Green Grid, Uptime Institute and EPA to consider these tweaks to their already fine work.




Wednesday, November 7, 2007

CIO Dialogue - Notes from the Real World

I felt the need to share the following conversation I recently had with the VP of Enterprise Operations for a major healthcare provider. On one hand, the conversation sounded like every stereotype I've heard in trade rags... except it's true. So read this, but be sure to get to the punch line at the end.

He's been in his job for 18 months, and is just now seeming to get his hand around turning the battleship. Which, I might add, "owns one of every imaginable platform and software type" and has perhaps 3,000+ apps on 12,000-15,000 servers, maybe 30%-40% is development. He's got lots of AIX and lots of Sun, but ultimately a mix of other vendors too.

When asked exactly what he owns, he says he doesn't really know... but they're planning a CMDB proje
ct soon. Also, they're quickly running out of data center space, and are pushing 95% of maximum UPS power in most locations. He's thrown-down the gauntlet and halted all new server purchases -- in favor of initiating a virtualization project (which, I might add, is getting upwards of 20:1 consolidation, although he knows that high ratio won't last). He's a risk-taker because he has to be.

So I asked him point-blank, what does he need to make this work. Without a flinch (or a smile) he said "Process and Automation." Process, a la ITIL, and automation -- both of the Run-Book style, as well as the operational style. " If I could have the automation vision that IBM was hawking a few years ago, I'd be thrilled. But it's still vapor".

The good news is that he's closely teamed with his Facilities manager to help him cope with power, real estate and cooling. The bad news is that the Facilities guy is also at wit's-end.

The punchline: This real-life vignette tells me that the traditional IT model is really broken. How come IT -- with all of its computers -- is actually the least automated and efficient arm of the company? I recently read a report from the Uptime Institute which talked about the Economic Meltdown of Moore's Law -- literally, for every $1 of compute asset, it currently costs $1.80 operate it; by 2009, electricity alone will cost triple ($3) what the box cost. What's wrong with this picture?

I know that my VP friend is not alone. But when will the treadmill of IT-being-slave-to-the-hardware end? I'd like to think that automation, active asset management, and the drive toward greater environmental efficiency will begin to influence vendors and managers alike.


Saturday, November 3, 2007

What the Green Data Center Can Learn from the Prius

With The announcement of Active Power Management, and now the Cassatt Active Response product line, I hope that data center operations will now nudge a little closer to the 21st century.

Here's analogy #1: You're driving and come to a red light, you stop the car but the engine keeps running. It's wasteful and inefficient, but because it's generally considered too inconvenient to sta
rt & stop your engine every time you hit a red light, nobody does it. Enter the Prius: Come to a red light, and the engine automatically stops; hit the accelerator, and it starts again. Simple. Automatic. Efficient.

That's the analogy Cassatt is bringing to servers -- if they sit idle, even for an hour a day, they're automatically shut off and re-started when
they're needed. For production environments, this might only apply to a few scale-out architectures that are provisioned for busy times-of-day, but for Development/Test, there are *always* machines that go unused for periods of time. Cassatt's Active Power Management takes care of this automatically. Simple. Automatic. Efficient.

Don't just believe me. On Aug. 2, the EPA published a Report to Congress on Server and Data Cen
ter Efficiency. A core tenet in the report states "implement Power Management on 100% of applicable servers" was a core aspect to "improved operation" of US data centers.

Oh - and here's analogy #2: (and believe-it-or-not, it's from Detroit as well as Japan): It's called Cylinder Shutdown. Turns out that when you literally don't need all the engine's horsepower, cylinders within the engine are dynamically shut down. Check out the future Northstar XV12 Caddy engine, as well the engine in the 2008 Honda Accord.

Turns out, Cassatt technology can do this with IT Servers/blades as well! If you have a farm of servers and a few are sitting idle, they're turned off and kept as "bare metal" until some application needs their horsepower. Then they're dynamically re-purposed for whatever application is needed. That's the ultimate in capital efficiency.

Can this really work? With customers we've spoken to -- some with development environments pushing 4,000 servers -- actively controlling server power & repurposing can save nearly 50% (that's fifty) of operational costs.

Think of all the cars idling at this very moment, and the amount of fuel they're burning. Now, think of all of the servers in your data centers & labs just sitting there waiting to do something. And think of all the Watts they're chewing.