Monday, February 23, 2009

Correcting computing's wrongs - road to recovery?

As brilliant as the first microcomputer architects were, there were some early design principles that, as the law of unintended consequences outlines, have seriously hamstrung enterprise computing for years. But the industry is about to get out from under them in a big way.

We're about to hear lots about
"Infrastructure Orchestration", by virtue of Cisco's anticipated entry into the blade market with their "unified computing" strategy. The principle has been known as that of a "computing fabric," first conceived by Vern Brownell, the then-CTO of Goldman Sachs, and later productized by Egenera.

Fundamentally, the concept abstracts-away a server's I/O, disk, storage connectivity, and out-of-band controls, making it a stateless entity. The result is a server with considerably more flexibility (e.g. ability to be re-purposed) and a significant simplification in how groups of these servers are managed.
Just wait 'till this catches on.

A bit of history: How did we get here?
In the early eras of PCs, a number of new technologies arose: in particular there was the IP network, that allowed the CPU to talk to others, and external/networked storage, that externalized (or removed) the dedicated hard drive. Both of these technologies instantly resulted in additional hardware on the motherboard: The Network Interface Card (NIC) for connection to Ethernet, etc., and the Host Bus Adaptor (HBA) for connection to storage. Later on there was another bit of hardware, the on-board controller, that helped monitor/control "out-of-band" aspects of the CPU like power, temperature, performance; this also had its own equivalent of a NIC. These pieces of hardware were sometimes incorporated into the motherboard itself, or sometimes were additional plug-ins.

But each new technology came at an (unwitting) price: they became tightly-bound to the hardware and software. Each had a software driver, usually tied to the O/S. And each usually had its form of addressing -- IP and MAC address for the NIC, and usually the Worldwide Name for the HBA. Often, the NIC and the controller were actually part of the motherboard itself.

The result: Servers, their O/S, and sometimes even applications, were tightly-tied to their I/O. Changes to the network or storage meant changing I/O configurations. Changes to the server meant re-defining addresses as well. Every time a physical server had to be configured (or re-configured), the NIC, the HBA and even the controller's IP address had to be configured too. (And, if the server was on a separate network, external switches had to be configured as well).

This all made for an operations nightmare. The Application owners had to work with the O/S owners, who in-turn needed a process to work with Storage and Networking groups. No wonder operational spending is rising.

An alternative model.
Vern Brownell (and others) recognized the source of this complexity and asked whether the compute (CPU, memory, etc.) could be complete disociated or abstrated from the I/O.

In essence, the compute resource would be a stateless resource -- agnostic to the SW it ran, and agnostic to what I/O it was connected to. The I/O would be "virtualized" into a logical (rather than physical) connection... which meant that addressing and naming could be provisioned/changed in software.

Further, the physical I/O and network could be collapsed/unified. A single wire could carry all signals, and a set of switches could create custom (or private) connections between servers, or from servers out to an external network and storage. Hence the term "computing fabric" began.

This concept was initially productized around 2001 by Egenera, in the form of their BladeFrame hardware and PAN Manager software (short for Processing Area Network), and recently expanded to Dell hardware as well. The analogy to a SAN was clear: An abstracted, centrally-managed set of CPUs rather than an abstracted set of Disks. In the way that LUNs are mapped to physical drives, logical nodes would be mapped to physical (or virtual) CPUs.


Properties of the "compute fabric" a.k.a. Infrastructure Orchestration (a.k.a unified computing)
Once a set of servers is part of this compute fabric, a number of very elegant properties arise. Chiefly, any CPU can be re-purposed to handle just about any workload (assuming CPU is compatible, and memory is sufficient). Issues having to do with I/O, storage connectivity, etc. evaporate.

So, for example, if a server running a native O/S were to fail, another "bare metal" server could be instantly re-assigned all of the properties of the original server, connected to the failed machine's network, and then connected to the failed machine's shared storage. Presto - instant High Availability (HA).

Next, extend this example to a bunch of servers (and networks and switches). Should they all fail, such as in a disaster, the entire configuration, down to each server's I/O, networks, VLANs, etc., can be re-created in a separate location on "cold" bare (unprovisioned) hardware. Presto - instant Disaster Recovery (DR). All this assumes mirrored SAN storage, of course.

So what you might still say? Well, consider if the "native O/S" in the example above was really a VMware ESX server host. That means that an entire host configuration (down to the VMs) could be re-created elsewhere without having to re-provision the hosts themselves onto a physical piece of hardware. Neat, especially if you find yourself having to first duplicate hosts, hardware configurations and networks for your virtual failover sites. Not very "virtual," are they?

Now, finally, consider a mixed environment -- with native O/Ss as well as VM hosts (e.g. an SAP installation where some servers are virtual, but with native DBs as well). Complete HA and DR could be provided to the entire environment. At once. Cool.


Where we're headed

So if you think about it, if the original CPU mother boards and servers *hadn't* been equipped with stateful peripherals like NICs and HBAs, much of the complexity we deal with in data centers would be obviated. Instead, we would take for granted the fact that just about any workload could run just about anywhere, with the assurance that any other hardware could pick up if the original failed. We would have "virtual hardware" the same way we have virtual software.

And there's the point: that fabric computing - infrastructure orchestration, unified computing - is actually the ideal complement to any virutal (or physical) infrastructure.

No wonder why we'll see and hear more about this in the near future. Hardware vendors (Egenera, HP, IBM, Dell) are already doing it, and Cisco is about to. And what of VMware or Citrix?

1 comment:

Anonymous said...

Very helpful and interesting article. It has given me pretty good knowledge about O/S...