This story has been updated from a previous version to note that MareNostrum 5 will be made available to a consortium of four member countries: Croatia, Portugal, Spain, and Turkey.
NEW YORK – Three new pre-exascale supercomputers are expected to come online in Europe within the next year, providing a computational boost that will impact genomics data sharing efforts and drive the creation of machine-learning tools and associated diagnostics, said users.
Pre-exascale supercomputers are capable of executing more than 150 petaflops, or 150 million billion calculations, per second, many times the capacity of the current generation of installed supercomputers. Three sites — in Finland, Italy, and Spain — were selected in 2019 through the European Commission's European High Performance Computing Joint Undertaking initiative to host the new supercomputers. An additional five sites in Bulgaria, the Czech Republic, Luxembourg, Portugal, and Slovenia were selected to host supportive petascale computing machines. The total HPC upgrade has a budget of €840 million (about $1 billion) and is part of the EU's strategy to improve computational capabilities throughout the region.
Two of the three new pre-exascale computers will be designed specifically to resolve biological research questions and will also be adjacent to the Spanish and Finnish nodes for ELIXIR, the European Life Sciences Infrastructure for Biological Information, a European initiative that enables laboratories across the region to share and store data.
One of these pre-exascale computers, called LUMI, will be based in Kajaani, Finland, and is expected to be operational by the end of 2021. The second, called MareNostrum 5, will be housed at the Barcelona Supercomputing Center in Spain and should be operational by year-end. The third, called Leonardo, is based in Bologna, Italy, and should also come online within the next year.
Tommi Nyrönen, head of ELIXIR Finland, said that LUMI will offer new opportunities not only for ELIXIR, but other data sharing efforts, such as the 1+ Million Genomes Initiative, a European project to make more than a million genomes accessible to researchers in the region. Finland is also hosting FinnGen, an effort to genotype about 500,000 people in its biobanks. In turn, there is also an opportunity to apply policy and technical standards set by initiatives like the Global Alliance for Genomics and Healthcare in these caches of more widely accessible data.
"It's my goal to ensure that resources like LUMI will be more available for life sciences computing problems, and we are pushing that ELIXIR data and biobank data [to] be computed in LUMI," said Nyrönen. "I think that in a few years it will be a reality that we will have structured datasets near massive supercomputers." He stressed that it is particularly important for the European Commission to have European computing capacities near General Data Protection Regulation-compliant data.
Europe's computational infrastructure was largely built to serve other data-rich research fields such as physics or astronomy, with a focus on carrying out simulations. While some in the genomics field have harnessed these same resources for their own simulations, Nyrönen said that making data available in batches or silos that can be harnessed for analyses, such as for developing personalized medicine algorithms using genomics data, would be an improvement.
"There could be a way to execute containerized loads, perhaps employing GA4GH architectures when designing those interfaces, and to make the computational loads of the life sciences more compatible with existing large-scale e-frastructures," said Nyrönen.
There is also the issue of whole-genome genotyping and next-generation sequencing data, which has increased exponentially, particularly in the past five years, Nyrönen noted. Data loads continue to increase, an issue that could be alleviated by new supercomputers.
"Over the past five years, sequencing prices have plummeted but data management costs have stayed the same," Nyrönen said. "So we have the same number of people handling more data." This in turn creates access limitations for scientists, as well as an overwhelming situation for data stewards, who cannot manage such data loads. Resources like LUMI, he said, can help resolve that challenge.
LUMI, which means "snow" in Finnish, is based far in the north of Finland at an old, repurposed paper mill, a location that lent itself well to being converted into a supercomputing center.
"Supercomputers require a similar concrete slab as a paper machine so they can't vibrate," noted Nyrönen. "Also, paper making requires a lot of electricity, so the wiring is already in place," he said. Excess heat from LUMI will feed into Kajaani's heating system and provide about 20 percent of the heat in the city, which has a population of about 37,000.
The supercomputer is expected to take up more than 150 square meters of space, making it about the size of a tennis court and the weight of the system will be nearly 150,000 kilograms. Developers claim that the computer, which costs €200 million, will have a peak performance of 550 petaflops per second, rivaling the world's fastest computer, Fugaku in Japan, which has a peak performance of 513 petaflops per second, although an upgraded version of Fugaku should come online in coming months as well.
Hewlett Packard Enterprise is supplying the supercomputer, an HPE Cray X, with next-generation central processing units and graphic processing units. The computer is also being designed to support artificial intelligence, combining traditional large-scale simulations with massive-scale data analytics in order to solve research problems, capabilities that should lend themselves to biological research.
"Overall, we see great potential for diagnostic AI development leveraging data from biomedical sciences supported by computing infrastructures hosted by the ELIXIR nodes," commented Nyrönen.
Ten countries — Belgium, the Czech Republic, Denmark, Estonia, Finland, Iceland, Norway, Poland, Sweden, and Switzerland — have participated in the LUMI consortium and therefore will have secured shares of the infrastructure allocated to them when the computer is online. Others that wish to access LUMI will have to go through a resource allocation process via the European Commission, Nyrönen said.
A challenge will be deciding how to securely manage sensitive data, such as whole-genome data from biobank participants, using the capacity provided by LUMI and other supercomputers. Nyrönen is co-leader of the infrastructure working group within the 1+ Million Genomics Initiative and also leads a workstream within GA4GH focused on daily use and researcher identity. The tools and mechanisms are being developed to address challenges, but will require adoption too.
"My goal is that the nine EU countries participating in LUMI will see more development of sensitive data-processing technology and policy stack on [high-performance computing] through these and other projects in the foreseeable future," said Nyrönen.
MareNostrum 5 and Leonardo
The second of Europe's new pre-exascale supercomputers being designed with biological research in mind is MareNostrum 5, which will be housed at the Barcelona Supercomputer Center and made available to a consortium of four member countries: Croatia, Portugal, Spain, and Turkey.
The computer, which should also be operational by year-end, should deliver around 200 peak petaflops, around seven times the computational power of the current MareNostrum 4 computer, which was installed in 2017. MareNostrum 4 has been hosted in an old chapel at the Polytechnic University of Catalonia in Barcelona, but given the size of MareNostrum 5, several racks will also be housed at the BSC's new corporate headquarters nearby. The overall project cost, including purchase price, installation, and five years of operation, is about €223 million.
Alfonso Valencia, director of the life sciences department at BSC and head of ELIXIR Spain, said the investments in pre-exascale supercomputers show the EU's commitment to making the region competitive with Japan, China, and the US, which are similarly gearing up for improvements to exascale supercomputing. He also noted that it will represent a shift for the genomics community, one made possible by recent improvements in computing technology.
"It's not a secret that high-performance computing has really not been accessible to biology, because the computers themselves have not been particularly adequate until now for genomics," said Valencia. He noted that computing centers to date have been historically designed to serve physics and engineering research, which revolve around simulation exercises and not biology.
"Biology is an area that is increasingly important in terms of science and applications and I think there is a good understanding from the side of computer centers that they have to link better with biology," said Valencia. "But this takes time."
Machine learning is also very much of importance to developers of the new exascale supercomputers. Valencia noted that he is the editor of the journal Bioinformatics, and that about 70 percent of papers reviewed these days feature machine learning tools.
"Artificial intelligence is impacting everything, from social science to economics, as well as the work of physicists," said Valencia. "AI is very obvious and is very serious in every area of computational biology, as well," he said. "It offers us the possibility to train systems with the previous history of a patient, given conditions, [and] genomic signatures, and calculate the risk of a heart attack, for example," he said. "It is very interesting from a medical point of view, but also a biological point of view, to understand the mechanistic reasons," he added. "As scientists we are interested in the process, in the model, [and] the reasons behind the predictions."
The capacity of MareNostrum 5, he said, could be harnessed to build new machine learning tools. However, it will take time to do so, as it takes significant investments to create AI, he said. The payoff will be later, when those same tools are used in diagnostics, Valencia said.
Like Nyrönen in Finland, Valencia said that ELIXIR will take advantage of the boosted capacity offered by MareNostrum 5 to renew its focus on creating interoperability between large data sets, core data resources, and linking databases. Data standardization as well as improved quality assessment of datasets will also be at the fore of ELIXIR's efforts, he said.
Altogether, he said the genomics community might see the first impact from the new supercomputers as early as next year.
"Seeing the benefit of these computers will take some time, as the data has to be stored, and we have to find a way of interacting, running projects, and developing working models," Valencia said, adding that the first publications to describe the benefits might appear sometime in 2022.
The real change will be in the way that researchers pose questions, Valencia underscored. Given new computational power, they can rethink how they design studies and what their goals will be.
"The more obvious impact is in the way of thinking," said Valencia. "Now one can start thinking of projects that will require different kinds of computation."
Leonardo is the third pre-exascale computer expected to become available within the next year. Cineca, Italy's largest computing center, will host Leonardo at its headquarters in Bologna. The supercomputer will be based on Atos' BullSequana XH2000 technology. It is anticipated that once operational, Leonardo will be able to execute more than 248 petaflops, or 248 million billion calculations per second. The project has a total budget of €120 million.
It is unclear what role Leonardo could play in genomics. Cineca did not respond to an email seeking comment. In a statement announcing the project, Roberto Viola, director general of the European Commission's directorate general for communications networks, content, and technology, predicted that Leonardo would "combine the best of artificial intelligence and HPC technologies."