A Tale of Two Data Centres
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness…” Charles Dickens, A Tale of Two Cities.
With apologies to Charles Dickens.
The data centre manager is responsible for maintaining their, or their clients’ essential systems and processes 24/7.
Power delivery is therefore critical and power protection systems must be available every second of every day and so maximizing system availability must be the overriding objective of any installation.
Availability can be defined as the probability that an item will operate satisfactorily at a given point in time, crucially it includes both preventive and corrective maintenance downtime. It is most often represented as the percentage of system uptime achieved in a year and by the equation of mean time between failure (MTBF) divided by mean time between failure, plus the mean time to repair MTTR. MTBF can be mitigated by overall system design, i.e. removing single points of failure and MTTR by product design. Over the years, many improvements have been made in relation to UPS technology and configurations to increase availability.
Data centre managers are naturally risk averse people as the consequences of going ‘off line’ even for a few seconds can incur significant financial penalties relating to service level agreements. Down time can result in loss of clients, loss of reputation plus the incalculable cost of missed revenue of potential clients shopping for a more reliable alternative. A pretty stressful occupation!
The Human Element
So why in the age of wisdom, do we still see headlines relating to large data centres power failures? Even if the most advanced technology is employed to create a resilient and highly available UPS system, there is still room for human error and there are many published statistics indicating the percentage of failures caused by such. Of course, problems caused by lack of training is a completely separate issue and no-one can mitigate against wanton malice. However it still appears that most of the high-profile incidents of data centre power outages have been linked to human intervention – accidental or otherwise.
Secure access of control rooms limit the chance of outside interference and thorough training and procedures – including the two man rule – reduce the risk of mistakes being made. Data centre managers put procedures and training in place to mitigate these risks as far as humanly possible but how can technology help?
From a technological point of view, building redundancy into the UPS system reduces the risk of the system going off-line and increases availability.
As data centres have evolved from using a single UPS to parallel systems, availability has increased. The higher the availability, the lower the downtime. The introduction of redundancy and low MTTR by rapid hot swap modular designs now means with some of the UPS’s on the market, six-nines (99.999999%) availability is possible. This equates to some 32 seconds downtime over a year, a relatively small value in time but to a data centre it is an eternity. So how can we increase this availability percentage even higher?
Distributed Active Redundant Architecture
Following extensive failure analysis research and insights gathered from 25 years’ of ﬁeld experience working with a large number of data centres and other critical environments, CENTIEL’s power protection solutions are reaching 9 Nines levels of availability, reducing downtime risk and avoiding costly errors.
Distributed Active Redundant Architecture (DARA) is a concept introduced by CENTIEL into its 4th generation UPS. This active-redundant technology alongside the elimination of potential single points of failure and the true modular hot swap capability allows CENTIEL’s CumulusPower to deliver an industry leading availability of 9 nines (99.999999999) to fulfil the needs of the most critical power applications. CumulusPower takes downtime from seconds, to the milliseconds level.
A Tale of Two Data Centres
Imagine Dave managing a large data centre in a remote location selected specifically because of the low cost of real-estate and the prevailing cooler ambient temperatures helping to reduce the cost of cooling. A modern modular UPS has been installed to provide critical power protection and ensure the availability of the data for numerous high-profile, house-hold name clients.
Dave well understood choosing a standalone type UPS where the main component parts of rectifier, inverter and static switch are modular: i.e. can be easily removed/inserted. It means if there is a problem with say the rectifier, it can be swapped easily. However, if any one of these component did fail then the whole UPS functionality goes down with it.
So Dave chose a modular system which includes the rectifier and inverter within individual power modules. However, one day the UPS display panel indicated an alarm associated with the single centralised static switch and Dave immediately put out a call to the service provider to attend to investigate. It should only have taken a few moments to swap out but, due to the data centre’s location getting to the site to replace took the maintenance engineer several hours. During that time the system lost its ability to transfer to to static bypass. Dave felt very exposed sitting there looking at the alarm panels and red alarm LED waiting for the engineer to arrive. Having this job is sometimes not the best of times.
Jim too manages a big data centre in another remote location. Jim understands the concept of decentralised architecture and how it increases system availability. He worked with his trusted advisors at CENTIEL to select a power protection system with the highest level of availability and installed their true modular UPS with DARA.
With Jim’s UPS all the elements of rectifier, inverter and static switch are contained within each individual module. He knows if a static switch fails in one module then he has not lost the ability to transfer to static bypass via the rest of the modules in the UPS frame.
One thing that was always at the back of his mind was the communicators between modules. Surely duplication and redundancy of UPS components must also apply to this aspect of the system design? The most simple communications bus is a single cable. If this breaks or becomes disconnected, the entire system could potentially be compromised. For this reason, the ring circuit was introduced. If the circuit breaks the signals can simply communicate the other way around the ring.
But Jim being the natural risk averse person that he is, wanted even more assurance and wanted to see how this was being addressed by the designer. CENTIEL’s Triple Mode communications bus was the answer. Like its name suggests, there are three paths of communication between UPS modules, and parallel frames, with three separate ring circuits, and three brains in each module communicating with the three brains in all the the other modules – it’s the belt, braces and buttons approach.
Jim likes the image of comparing Triple Mode to a tightrope walker. If a tightrope breaks, the consequences will be dramatic and far-reaching. In the same way, a single communications bus is far more precarious than a Triple Mode ring connection which is more like a bridge with multiple supports. Here potential single points of failure are removed. Even if one or several bridge struts fail, the others will support the load.
While we all understand what the D and R mean in DARA, distributed and redundant through decentralised parallel independent UPS modules with triple communications what does the the A stand for?
A is the automated democratic decision making process which is another real differentiator in CENTIEL’s 4th generation true modular UPS. The sum of the decision determines the total system action or reaction to any issues.
In Dave’s UPS system in our first data centre example, if five modules share a load, if one has a problem it may signal all the modules go to static bypass. With Jim’s system, democratic decision making recognises a fault in one module and the other four will remain online while the problematic module is switched off automatically, allowing for replacement or repair while the load is still protected. No single component takes decisions for the whole system.The automated process removes some of the human element which has led to the majority of data centre power failures in recent years.
A static switch in a module goes down. Jim is alerted to the single module fault as his critical facilities continue to be maintained by the other UPS modules. Jim phones the engineer so it can be replaced while he grabs a quick coffee. Having this job is the best of times.
Naturally, often cost comes into the decision making process when purchasing a UPS. However, the purpose of a UPS system must be to protect critical loads with the highest level of availability. There must be no potential single points of failure. Therefore, it is important to check the configuration and the definition of a modular system carefully and seek expert advice before purchasing.
At CENTIEL our design team has been working with data centres for many years at the forefront of technological development. We are the trusted advisors to some of the world’s leading institutions in this field. For this reason, we have developed our pioneering 4th generation true modular UPS system CumulusPower which offers offer industry-leading availability of 99.9999999% (nine, nines), with low total cost of ownership (TCO) through its Maximum Efficiency Management (MEM) and low losses of energy.
Article originally featured in DCNN November 2018