While the rule of ever-increasing processing power known as Moore's law has undoubtedly provided high-performance computing with capacity that continues to push the boundaries, nothing comes without its price. And that price is literally the cost of the energy bills received by data centers where, on average, 30 to 50 percent of the power being drawn into a site is siphoned off to deal with the heat generated by the innumerable transistors crammed onto motherboards in today's powerful server racks. This is true despite the new world of green, energy-conscious HPC, where virtually every hardware vendor is touting energy-efficient designs. "If you're looking at HPC-type applications and the power that's being put into individual racks, that's increased from about 3 to 4 kilowatts per rack to maybe 30 kilowatts in about 10 years, so there's this dramatic increase in the number of processors people can put into a rack," says Yogendra Joshi, a professor at Georgia Tech's Woodruff School of Mechanical Engineering. "You're having to supply a lot of cooling energy to remove that heat, and ... that extra power ... is going to cooling rather than computing."
A team of researchers led by Joshi is using a 1,100-square-foot simulated data center to study heating and cooling strategies for future centers. The simulated data center run at roughly 500 watts per square foot and is comprised of two large banks of racks, one of which is a mock-up of a real rack. The mock-up consists of heaters, carefully calibrated to mimic the heat emanating from a real working rack unit. Along with infrared sensors measuring the temperature coming off the racks and on the motherboards inside, the data center is also equipped with airflow sensors — combined with fog generators and lasers — to paint a picture of how the air is moving through the room and how effective the cooling fans are at circulation. The mock data center is also equipped with partitions so that Joshi and his team can experiment with different equipment placements to obtain maximal heat management and cooling airflow. Joshi is also working with computer scientists at Georgia Tech to develop algorithms that will monitor hotspots in a configuration, and reassign a computationally heavy workload to another, cooler area of the data center. Along with these modeling algorithms is a virtualization technology, where virtual machines running on top of a computer's operating system can be moved to other physical machines. "If you could monitor the temperatures in the facility [in] real time, then you could allocate computer moves as a function of time. So let's say a data center has certain load profiles, maybe you have maximum load that occurs during the middle of the day and before or after," Joshi says. "Through virtualization, you could allocate that load amongst the machines whichever way you please and if you can do that in an energy-efficient fashion and allocate the load to the coolest part of the data center, then you would not need to spend as much cooling power as if you were doing it blindly and equally distributed among the machines."
Metrics, metrics, metrics
How can one get a grip on heating and cooling issues in a data center? Joshi says it is all about metrics. "You have to be able to measure things; if you don't measure, then you don't know how good or bad you are," Joshi says. "Most data centers are not doing a lot of measurements, but it should be relatively simple to put temperature sensors at the inlet and exit of each server, and similar temperature sensors can be deployed to other locations in a facility as well." For a helpful way to measure energy efficiency — or lack thereof — in a data center, Joshi recommends the Power Usage Effectiveness metric by The Green Grid, a global consortium of IT companies and professionals seeking to improve energy efficiency in data centers. This PUE score measures the amount of energy a site draws from the local power grid and the amount of power that the site actually uses to keep things up and running. For example, a bad PUE score would be two, meaning that for every two watts a server rooms draws from the grid, only
one watt is used for actual computing due to inefficiencies in either the infrastructure of the data center or a lack of energy-efficient hardware.
Joshi and many other researchers on the cutting-edge of HPC heating and cooling research are looking toward liquid cooling technology as the ultimate way to usher heat out of the data center and keep energy bills down. Liquid cooling technologies are getting closer to the processor itself, which might make some IT folks nervous at first, but anyone who knows a bit of HPC history will recall that liquid cooling was first seen in big mainframe systems more than 20 years ago, and worked rather well. As a coolant, water is capable of capturing and transporting heat about 4,000 times more efficiently than air. Despite that, the initial reaction to the idea of water being anywhere near IT equipment is usually that of skepticism. "The normal reaction of a data center user is that he is still afraid of having water in his systems, but our argument is that we already had water in bipolar mainframes some 20 years back and had good experience with these systems, and there's already a system now available with liquid cooling called cold water cooling," says Bruno Michel, manager of Advanced Thermal Packaging at IBM's Zurich Research Laboratory. "The use of liquid in computers is getting more and more straightforward as time goes on."
Michel helps to lead a team of researchers at the Swiss Federal Institute of Technology in Zurich that is working in conjunction with IBM to incorporate liquid cooling into a supercomputer's architecture with a newly designed, water-cooled, 10-teraflop supercomputer called Aquasar. The system uses a cooling approach known as chip-level or chip-attached water-cooling, where each server blade is equipped with a microscale high-performance liquid cooler for each processor. The other liquid cooling method is called direct-chip backside cooling, where the silicon is in direct contact with the fluid. (Aquasar will eventually use this method because it is more effective.) Aquasar requires about 10 liters of water circulated by a pump at approximately 30 liters per minute around the system in a closed-circuit setup. The cooling water is continually heated by the chips as it goes through a passive heat exchanger. The waste heat is removed and delivered to the university's heating systems where it is used for hot water and space heating during the year, reducing the campus' carbon footprint by some 30 tons.
When it comes to maintenance issues, Michel says the only real challenge is ensuring that the water pumps are working properly. "The weakest element is the liquid pump, so we have two pumps in the systems for redundancy. If one fails, the other takes over and sends an alert," Michel says. "Because we have microfluidic cooling, we have very small channels and we need to make sure that we have a filter and we also have a chemical additive in the fluid to prevent electrochemical corrosion — this is a known composition that is contained in all systems that contain copper tubing and heat exchangers. It is hermetically sealed, so there is a very small risk of failure, and we are also detecting the internal pressure of the system and [detecting] very small losses of liquid."
Air before liquid?
According to Georgia Tech's Joshi, liquid cooling will certainly play more and more of a starring role in the battle to keep data centers cool, but it will be just one of several options. Ultimately, it all comes down to the size of the data center, and the watts each rack is eating up throughout its workday. "You want to eke out as much as you can from air cooling. ... Fifteen kilowatt racks are good with air, but anything more than that and you're probably going to be moving towards liquid cooling," he says.