Power problems are the main cause of data centre downtime

Server room.
Sheila Zabeu -

April 12, 2023

The saying goes that prevention is better than cure. In the case of datacenters, this is no different. And to do preventive maintenance of these environments, nothing better than knowing what are the most common causes of downtime of operations to try to avoid them.

According to a recent study by the Uptime Institute, data centre downtime rates have been gradually falling in recent years. While most facilities have experienced periods of operational downtime in the past three years, only a small proportion have been considered severe or severe. Severe cases result in service and/or operational downtime with financial losses, compliance violations, reputational damage and security issues. Severe cases include consequences of the same type, but to a higher degree.

Uptime Institute global survey of it and datacenter managers 2019-2022

However, the low frequency of severe and severe cases is not cause for celebration when considering the broader picture. When they do occur, outages are increasingly costly due to the increased reliance on digital services — a quarter of respondents said their most recent outage cost more than US$1 million in direct and indirect costs. Another 45% reported costs between $100,000 and $1 million.

Diving into the details of the study, one can deduce that the frequency of outages is relatively high, despite technological improvements and physical redundancy. However, there is no evidence that the number of outages in datacenters is increasing relative to general IT — and may be slowly falling. Uptime Instituite is researching this further.

Understanding the causes

Power problems remain the main cause of significant outages by a wide margin. The other causes are much less common. However, three other reasons stand out as particularly problematic: cooling failures, IT software/system errors and network problems. The frequency of problems with outsourced providers, e.g. software as a service (SaaS), hosting and cloud services, are increasing.

On-site power problems remain the biggest cause of significant site outages by a large margin

Downtime related to power problems can affect entire facilities and bring service delivery to an immediate halt. Diagnosing and restoring power can be done quickly, but restarting IT systems and synchronizing databases can take many hours. In addition, power failures can damage equipment and leave datacenters out of commission for long periods.

The biggest cause of power outages are UPS (ironically, Uninterruptible Power Supply) failures. Generator and transfer switch failures were experienced by just over a quarter of operators in a 2023 datacenter resiliency survey.

The breakdown of causes of these third-party outages, as perceived by the customers affected

Engineers at the Uptime Institute explain that static UPS models fail for several reasons:

  • Fans often fail because they are generally cheap and in constant operation.
  • Damping absorption capacitors can fail due to wear and tear.
  • Batteries fail because of long time use and not being closely monitored.
  • Inverters fail less frequently.

The study estimates that human error plays a role in two-thirds to four-fifths of all outages in datacenters. Such errors are mainly generated by employees not following procedures or by the procedures themselves being incorrect.

The current edition of the Uptime Institute’s annual survey used three primary sources: Global Data Center Survey 2022, conducted in April and May 2022, with about 830 operators; Data Center Resiliency Survey 2023, conducted in January and February 2023, with 739 respondents; public information reported or tracked by Uptime Institute in 2022.

Problems with fire

The use of lithium-ion batteries is growing among datacenters. According to Frost & Sullivan, this category represented 15% of the battery market for these environments in 2020, but the percentage is expected to reach 38,5% by 2025.

However, the Uptime Institute warns that lithium-ion batteries present a greater risk of fire than valve-regulated lead-acid batteries, regardless of their specific chemistry and construction – a position endorsed by the US National Fire Protection Association and other regulatory bodies. Because cell breakdown in lithium-ion batteries produces combustible gases (including oxygen) and fire spreads uncontrollably, fires in this situation are notoriously difficult to fight.

Many datacenter operators are finding the risk-to-benefit ratio of lithium-ion batteries acceptable. According to a 2021 Uptime Institute survey, almost half of operators have adopted this technology in their plants, representing an upward trend in battery usage from around a quarter in the previous three years. The 2022 edition found even higher levels of adoption.