Thursday, November 15, 2007

Assessing the New Data Center Metrics

I've been reading-up on work that the Green Grid, Uptime Institute and others have been doing to define metrics around data center efficiency. The work is good but in my mind, misses the mark slightly. All of the metrics I've seen thus far are static - That is, they assume some steady-state aspect of the data center... steady compute loads, steady quantity of servers, etc. But that's not how the world works.

Even Detroit knows that autos get different efficiencies based on how & where they're driven... so the metric called "mileage" is actually measured & documented twice -- one for City, one for Highway. Data Centers need something akin to this as well.

Why? Because IT departments operate at greatly different levels; peak (maybe during the day) as well as off-peak (perhaps nights/weekends). Ideally, the data center should know how to adapt to these conditions: re-purposing "live" machines during peak hours; retiring and temporarily shutting-down idle servers during off-peak; removing power conditioning equipment when not needed; turning off specific CRAC units and chillers when not required (i.e. cold days and/or off-peak hours). We need an efficiency metric that indicates how data centers operate Dynamically.

Anyway, here's a quick survey course in what metrics I did find, and what I'd like to see:

The Green Grid on metrics:
  1. Data Center Infrastructure Efficiency,
    DCiE = (IT equipment power)/(total facility power).

    This is supposed to be a quick ratio showing how much power gets to servers, versus how much else is consumed by power distribution, cooling, lighting, etc. Driving this ratio up means you have less overhead wasting Watts. This wouldn't be too bad a metric if it was used and monitored 24x7, i.e. peak and off-peak.
  2. Power Usage Effectiveness,
    PUE = 1/DCiE (just the boring reciprocal)
  3. Data Center Productivity, (a metric to be adopted in the future)
    DCP = (useful computing work)/(total facility power)

    In theory, this is a great metric: It's like saying "how many MIPS per Watt" can you produce? (BTW, the human brain, the most powerful of all computers, consumes somethling like 25W). Anyway, DCP is a contentious metric... because each computing vendor wants to define "useful computing work" with their own (preferential) way of computing. Frankly, this is most useful to measure efficiency at the server level.
The Uptime Institute

In an excellent paper, the Uptime Institute discusses these in "Four Metrics Define Data Center "Greenness":
  1. Site Infrastructure Energy Efficiency Ratio:
    SI-EER which the Institute is currently working to re-cast in more intuitive and technically accurate terms. I suspect this is much like the Green Grid's DCiE, above
  2. Site Infrastructure Power Overhead Multiplier Which is essentially the same metric as the Green Grid's PUE, above
    SI-POM = (data center power consumption at the meter)/(total power consumption at the plug for IT equipment)
  3. Deployed Hardware Utilization Ratio:
    DH-UR = (qty of servers running live applications)/(total number of servers actually deployed)

    This speaks to the real-time utilization of hardware, and IMHO is one of the best metrics for a dynamic data center. It points to how many deployed servers are actually doing work, vs. those that are sitting "comatose". A very promising metric if it's used in conjunction with equipment that constantly optimizes how many servers are "on", and shuts down idled servers, constantly minimizing this metric.
  4. Deployed Hardware Utilization Efficiency
    DH-UE = (minimum qty of servers needed to handle peak load)/(total number of servers deployed)

    This is another great metric - it speaks to the capital efficiency of hardware - how many need to be provisioned and on the floor, relative to how many are being used actively.
In my ideal world, I'd like to see two things to get us to a "City/Highway" style approach:
  • A DH-UR that changes dynamically, constantly being minimized. This implies that only required servers are actually powered-up and active.
  • An SI-POM that was always driven toward a constant ratio, regardless of compute demand. Which implies that, as compute demand falls, servers are retired and other support equipment (power handling, cooling) also shuts down, keeping the efficiency ratio balanced.
I look forward to conversations with the Green Grid, Uptime Institute and EPA to consider these tweaks to their already fine work.




3 comments:

bcinque said...

Ken,

Your views on DH-UR are interesting. I find them analogous to MAID in the storage world. MAID is getting some traction, slowly via solutions from Copan and others.

From the systems perspective - how do you think we can get a balance of optimization vs RAS. Is it feasible in the future to predict failures and shift load dynamically to another idle node all with in the realm of the application SLA's? or better yet, seamlessly to the app or even the end user/consumer?

Interesting blog. thanks!

brian

Dave Ohara said...

Ken, I like what you wrote, and I incorporated it in http://www.greenm3.com/2007/12/dynamic-power-u.html . I plan on writing on exactly this subject of dynamic PUE. You can contact me at dave(a)greenm3.com if you are interested in an early draft.

Simon Rohrich said...

Thank you for this. I have been trying to communicate these metrics to the pencil pushers. This helps. We have a unique datacenter that is 80% recyclable, small footprint and extremely energy efficient