How Confident are you in the Thermal Stability of your Data Center?
What is the thermal stability of the cooling system in a data center?
In a data center, the cooling system plays a crucial role in maintaining a stable temperature to protect the IT hardware from overheating. The cooling system removes heat from the white space to the outside atmosphere.
What would happen if the cooling system stops working?
The IT equipment would continue to produce heat even without proper cooling, leading to the accumulation of heat in the white space. As a result, the hot air would circulate through the IT equipment, causing the overall temperature to rise. Ultimately, the IT devices would activate their thermal protection and shut down one by one until none are left on.
What happens to cooling equipment, such as a chiller or CRAC, when the power goes out and comes back on?
Without a UPS system to backup it, your cooling equipment will go through the same situation as described above, which could result in stopping cooling and increased risk.
If you have not faced any cooling system issues till now and you are not following the best practices for continuous cooling, you might think that you are lucky. It is possible that the current IT load and equipment density are low enough to avoid any risks, and you have not experienced any major power failures. However, it is crucial to note that this does not ensure safety in the future.
Sensitive heat transfer is based on the rule that heat always moves from hotter to cooler areas. In hot climates, the "refrigeration cycle" principle has been used for over two centuries to achieve this. The refrigeration cycle pumps the heat from low temperature to high-temperature level. Therefore it is called also as "heat pump".
This method involves changing the state of the coolant from gas to liquid and back again, which requires a compressor in the system. However, the compressor can only work with gas and may malfunction if it tries to work with liquid. Therefore, all cooling devices with a cooling cycle system have a safety feature that requires them to wait a few minutes after a power outage to resume working. This means that even if the power goes off for a short time, it will take about 3 to 6 minutes for modern cooling devices to start working again.
It's important to ensure that your cooling system can maintain a stable temperature in your data center room even in the worst-case scenario. This means that the cold air reserve and cold coolant supply accumulator should be able to keep the temperature under control in limits. When the IT load and heat density increase, the cooling system may not be able to maintain stable temperatures.
Thermal Stability and Tier Classification
According to the Uptime Institute, Tier IV facilities must ensure "continuous cooling" to maintain thermal stability. This is a mandatory requirement for this level of standard compliance. However, for Tier I to Tier III facilities, thermal stability is only an optional recommendation and not a compulsory criterion.
During each power outage, if a portion of servers shut down due to thermal protection, the importance of the Tier level diminishes.
The ASHRAE standard also sets guidelines for thermal stability
It specifies strict limits to ensure the reliability of disk and tape storage devices when the temperature changes. For example, the minimum temperature is 15°C, the maximum temperature is 32°C, and the rate of change of temperature is less than 5°C/h.
The table below displays simulation results for various rack densities. White space with a rack density of 5 kW reaches thermal shutdown in 4 minutes while with 8 kW rack density, it lasts for 2 minutes. The shutdown temperatures are assumed to be between 63 and 75 °C.
领英推荐
It is better to face the facts before taking any risks, as calculated scenarios are based on ideal conditions and may not reflect the actual risks involved.
What Are the Best Practices for Controlling Thermal Stability?
If a professionally designed management system exists in your data center, you are likely monitoring the temperatures of the cold corridors in the white space. It is important to note that these temperatures should be measured at intervals of less than one minute and sufficient samples should be taken before being recorded for trend analysis.
One way to evaluate the thermal stability of the white space cold corridors is to measure the changes in temperature during and after power outages. By tracking these changes over time, one can estimate how the thermal stability will vary with the growth of IT load and density.
The density distribution of power density in the white space affects the placement and frequency of sensors. By adjusting these factors according to the density distribution, the tests can be more sensitive.
To accurately evaluate the thermal resilience of a data center, it is important to monitor the fluctuations in inlet air temperature during power failures and restoration, as well as keep an eye on the sensors that provide essential information about the status of the entire cooling system. It may be necessary to upgrade the infrastructure management system to allow for and record the overall values.
It's important to test the thermal stability of your data center as soon as possible, especially if you're not using a continuous cooling system. It's highly recommended to perform these tests regularly, as well as for data centers that use continuous cooling systems. This will help ensure that the equipment is functioning correctly.
References:
Data Center Cooling Optimization Series
Data Center Construction Series
Vice President at Nitel Technology & Engineering Istanbul Ankara Baku
9 个月Good subject and expression. Thank you
Data Centre Engineer
9 个月Safety first! It’s important to stay proactive about cooling system maintenance. ??? #datacenter #bestpractices
Data Centre Consultant, Chartered Engineer, Chartered IT Professional, Non-Exec, Standards Expert and Experienced Panel Chair
9 个月All good points but to manage cooling effectively and to prevent any problems before they happen you need an effective tool. Take a look at the #EkkoSoft product from EkkoSense AI which offers real time monitoring, capacity planning and failure scenario predictions among other features as well as now offering ESG reporting. Above all though #EkkoSoft provides recommendations and solutions to issues before they arise using ML and AI. Take a look and avoid the issues highlighted above: https://www.ekkosense.com/ Alternatively visit EkkoSense AI on stand D210 at #dcw2024 in London on the 6th and 7th of March and ask for Venessa Moffat or Matthew Farnell.
Hyperscale Data Center Infrastructure Specialist, Strategist, Energy Efficiency & Sustainability Leader, with 40+ years in tech Researcher/Inventor/Fellow/Advisor
9 个月The graph above appears to track transient temperatures following some kind of thermal excursion/runaway. CFD tools can give a better view of the true impact, although only 1 tool I'm aware of has a transient analysis tool which is Tileflow. What isn't obvious is the design weakness of locating all the CRACs at one end of the room and the impact to cooling effectiveness and efficiency.