There are quite a lot of components that go into a data center, with their operations expected to be reliable 24/7. Arguably the most important are those providing the power to the IT equipment as well as the other supporting systems that keep the data center cool, secure, and operational.
Preventative maintenance and a good operations team are vital to ensure that power equipment does not fail. Another key, yet often overlooked aspect is giving the IT team information on how to spot degradation and early warning signs that lead to failure of major power equipment, such as UPS systems, PDUs, generators and more. To do this, the IT and other teams in the data center should become familiar with the power equipment to recognize with something is amiss, whether visually, audibly, or even by smell. Daily equipment inspection walk throughs can be led by IT team members, which also empowers them to raise questions about how the equipment is operating and what to expect with updates and changes.
In addition to the regular service and inspections, infrared scans should be conducted routinely to find hot spots within units. Those hot spots are indicators of loose connections, bad contacts, and other issues. Newer equipment is often equipped with scanning windows in locations that can reveal the most about the interior operations without needing to open the equipment. For UPS systems, cooling fans that are noisy or spinning at max rates generally indicates that a failure may be imminent, as overheating is a byproduct.
Knowing the load values of the equipment can help ensure that they are not exceeded, especially when adding more equipment, with the typical goal of being below 90-95% of the rating. Operating beyond the nameplate rating, while possible to do for long periods, stresses the equipment and will lead to an earlier failure. Rotary UPS systems should be checked using vibration analysis to understand how the bearings are doing and if they are nearing failure. Old batteries, loose connections, aging capacitors, and circuit breakers that aren’t exercised regularly are all potential issues.
Batteries are a common failure point for UPS systems, which can lead to a catastrophic failure that affects everything downstream. Larger UPS systems and almost all for data centers include monitoring capabilities on the welfare of their associated batteries. But often a well-maintained UPS system isn’t failing from the batteries but other causes such as overloading, shorting, or loading with equipment that has a large surge current when energized.
PDU issues are often wiring related, which leads to sparking and overheating. Plugs that are poorly fit can also cause sparking between sockets. Switchgear, meters, and gauges should be checked to note changes in performance or visual appearance. Even though it may seem like a stretch, the internal data center operations teams should also stay current with the backup generators and how increasing demand affects their performance, such as running hotter than normal.
Knowing the equipment well enough to understand what is typical takes some experience in walking the spaces. Over time, IT and operations personnel can become adept at spotting trouble because they know their equipment and can spot changes from day to day.
Once an issue is observed the proper solution should be identified and applied. This usually means going beyond applying a ‘patch’ and waiting for weeks or months until budget or personnel are available to fix. If a UPS has failed and the cause can’t be identified, it is better to replace in its entirety instead of attempting to replace parts and hoping that the issue is solved. A burnt out PDU and other distribution equipment should be removed without skimping on the replacements.
Environmental monitoring is a good way to understand the data center from another perspective. Although site walks and teams in the data center might spot anomalies, sensors can identify hot spots and air conditions that may otherwise go unnoticed.