Resiliency of mega datacenters

Resiliency is Key

Mega datacenters are resilient. Resiliency is not just about functionality and reliability, but also about being able to continue operation in the face of failures and adversarial actions.

Live and Dynamic

Resilient systems need to be live and dynamic. It means that the system has the ability to understand what’s going on at the moment (live) and being able to respond to such events in an automatic and policy-driven manner (dynamic). Building perfectly resilient systems is impossible. You need to consider the implications of boundary failures and accept those conditions to be able to build a practical resilient system.

Software-Defined Fabric Controllers

Mega datacenters resiliency comes primarily from intelligent orchestration software. Every service provider will probably have its own name for such software since most of these high-scalable and resilient systems are custom built. I will call this generic class of software as Software-Defined Fabric Controllers (or SDFC for people who love acronyms).

Analogy with the Human Body

In my last mega datacenter post, I provided a pointer to a free ebook that compares it to a warehouse-sized computer. I want to provide another analogy today which will make sense even to the people who do not understand how computers internally work – the analogy of the human body.

You can think of SDFC as the heart and the brain of the mega datacenter. It is usually well-protected from the other servers inside its own Paxos cluster (for availability). The other vital organs of the mega datacenter include the thousands of servers running application workloads, and scores of network leaf and spine switches, and a few other hardware and software services that perform different actions important to the operation of the mega datacenter. Power and cooling infrastructure is like the blood veins that carry the fuel that the “body” needs; while network cables and connections are like the veins that deliver important information across the datacenter.

SDFC creates the Fabric

SDFC is where it all starts. It takes bare-metal servers and lifeless network wires and programs them to behave like a fabric, which is a logical abstraction of the datacenter, on top of which applications can be deployed to run. Getting to a homogeneous fabric abstraction requires a lot of heavy lifting – learning about the hardware, pushing down the right software to different pieces of hardware, getting the right policies and configurations on them, and make them become part of a unified fabric.

SDFC is Live

After the fabric is created, SDFC begins to continuously monitor the different pieces of hardware and operating environments that are part of the fabric. As such, it collects sufficient information from every nook and corner of the mega datacenter on every “heartbeat”.

SDFC is Dynamic

The laws of large numbers mean that low probability hardware failures (accidental or adversarial) will be a norm when you consider the fact that we are talking of a fabric comprising thousands of servers. SDFC must not only notice these failures but be able to respond to them. It must, therefore, continuously run algorithms to understand what the heartbeat data means, and be able to craft and deliver a response in short order before applications are affected.

Delivering the Response

The ability of SDFC to deliver automatic updates, in a manner that is seamless to the application and the users of the mega datacenter, is key to making it resilient. You might remember the news from a few days ago that Verizon was willing to bring down its cloud for “forty something” hours to deploy their seamless servicing capability such that “…virtually all maintenance and upgrades to Verizon Cloud will now happen in the background with no impact to customers”. This is a testimony to the importance of this ability in SDFC.

Summary

Building a modern datacenter is not just about virtualization as many people are led to believe, it is primarily about building intelligent orchestration systems that are resilient enough to extend the failure boundaries to a level that is acceptable to the business. When designed well, these systems scale and stand the test of time. When not, every day is a fire-fight and every quarter is a massive manual software update.