Frequency and severity of datacenter outages are falling

Co-workers in data center server room
Sheila Zabeu -

May 09, 2024

As the number and size of datacenters expands to meet the demands of new Artificial Intelligence (AI) applications, it is expected that the number of outages in these facilities will also increase. However, data from a report by the Uptime Institute reveals a consistent downward trend in the frequency and severity of outages observed over several years.

More than half (55%) of the datacenter operators interviewed in 2023 reported having experienced outages in the last three years, a decrease compared to the 60% share in 2022 and 69% in 2021. In addition, only one in 10 outages in 2023 was classified as serious or severe, which represents an improvement of four percentage points on the 2022 responses and 10 percentage points compared to 2021.

Although the frequency of outages has decreased, the Uptime Institute warns that there is no room for complacency, as the rates are still worrying. The high financial costs and other associated reputational damage resulting from datacenter outages are still a major source of concern and a strong driver for investment. Furthermore, migration to public clouds does not necessarily mean that there will be fewer outages.

Another concern has to do with the instability of electricity grids. There is evidence that the global migration to more dynamic and renewable networks will reduce the reliability of systems, points out the Uptime Institute. If this is the case, datacenters could suffer more outages, often because uninterruptible power supplies (UPS) or generators don’t respond adequately to downtime.

Extreme weather events such as high temperatures, windstorms and floods have also been associated with datacenter outages in recent years. And this trend is likely to intensify, increasing the risk of these facilities being paralysed.

And who would have thought that the adoption of new technologies to increase the resilience and energy performance of datacenters could also bring increased risks of disruption? According to the Uptime Institute report, the use of software-based distributed resilience to dynamically move traffic and workloads can reduce outage risks over time, but increase them during an introductory period. Another example is the adoption of liquid cooling systems, which can mitigate some thermal risks but increase the risks of downtime in the event of component failure.

Despite the increase in risk factors, the Uptime Institute’s 2023 report suggests that the rate of interruptions per installation is decreasing. What could be behind this trend is that most organisations are investing more in redundant physical infrastructures year on year.

For the Uptime Institute, this trend contradicts expectations that multi-site approaches undermine physical site redundancy strategies. While the industry may indeed be moving towards software-based distributed resilience models, maintaining and raising the redundancy of on-site facilities remains a priority for most datacenter operators.

Causes of interruptions

The task of identifying the main causes of outages in datacenters can often be challenging due to the multifaceted nature of incidents.

The Uptime Institute’s annual surveys have consistently shown that interruptions in local power distribution are the most common cause, which is not surprising given that IT hardware is very susceptible to variations in power supplies, such as voltage fluctuations in fractions of a second. On the other hand, failures or poor performance of cooling equipment are generally tolerated for longer periods.