NEW YORK (GenomeWeb) – Supercomputing vendor Cray is offering multiple high performance computing options that it believes are a good fit for the life sciences, and is working on getting some of those systems installed in data centers within the pharmaceutical industry, in particular, but also at large sequencing centers, a company official told GenomeWeb recently.
In addition to its supercomputing and storage platforms, the company also has newer data analytics options that it believes will be particularly useful for life science customers.
According to a summary sheet provided by the company, the Cray computing solutions that are best suited for the life sciences domain include the Cray XC series supercomputer, the Cray XC40 and XC40-AC (air cooled) supercomputing systems, and the Cray CS-storm system.
More specifically, the Cray XC series supercomputer offers scalability and performance, making it suitable for message passing and data movement tasks, according to the company. Meantime, the XC40 system is designed to more quickly handle large-scale computations that typically require tens of thousands of cores in less time, while the XC40-AC couples high-performance computing with economical packaging, networking, cooling, and power options. Finally, the CS-storm system has a high density accelerator that features up to eight NVIDIA Tesla GPU accelerators and provides a peak performance of more than 11 teraflops per node.
The company's list of storage and data management solutions includes the Tiered Adaptive Storage system, which provides tools for moving less-used data to storage as well as for sustaining data archives; the Sonexion scale-out Lustre storage system, which provides large quantities of storage and scales large I/O performance incrementally from 5 GB/s to 1 TB/s in a single file system; and Cluster Connect, a Lustre storage solution for x86 Linux clusters.
In terms of data analytics, the company has two offerings — the Urika-GD graph discovery appliance and Urika-XA extreme analytics platform — that it believes will be particularly appealing for the life sciences, Ted Slater, Cray's senior solutions architect, told GenomeWeb.
The Urika-GD graph discovery appliance offers tools to identify relationships and patterns in data as well as perform real-time analytics on complex graphs. The appliance provides graph-optimized hardware, shared memory, multithreaded processors, and scalable I/O. It includes a semantic database that provides an RDF triplestore and a standard SPARQL query engine. The system supports pattern-based searches, and inferencing, and easily accepts and adds new data into existing graph structures.
Urika-GD is essentially the same system that YarcData, a Cray spinoff, sought to commercialize a few years ago. In 2013, YarcData — Yarc is Cray spelled backwards — was marketing the uRiKA graph analytics appliance, a much smaller, cheaper, and more user-friendly alternative to traditional supercomputers, which have historically been Cray's forte. YarcData also operated under a different business model than its parent, offering customers the option to purchase their systems outright or pay for annual or multi-year subscriptions that would provide limited on-premises access to the appliance. Cray has since reabsorbed the company.
Cray developed Urika in conjunction with the US government after the events of 9/11. By analyzing networks of interactions between individuals, institutions, and organizations and hunting for patterns, the government hoped to forestall future terrorist attacks. Network structures are part and parcel of biology and other domains such as finance and sports. In biology, analyzing interactions and connections within molecular networks can help researchers identify predictive biomarkers or create more efficacious treatments, for instance. Urika-GD, with its pattern matching capabilities, is well suited for that task, according to Slater.
It's designed to overcome some of the problems that show up when researchers try to run computations on large interconnected graphs, he said. Conventional architectures co-opt lots of processors to run computations on graphs but each processor has its own memory which holds only a piece of the larger graph, and when these processors do connect with each other, they usually do so over very slow networks, he explained. That's problematic because graphs are large, highly interconnected structures and analyzing them in pieces poses problems.
"The difficulty arises when you are trying to find a pattern in the graph and ... are therefore traversing that graph in the computer and you come to the edge of the graph for that particular processor and all of a sudden, you have to go to another processor," he said. "That is where your computation slows way down ... and that happens all the time in graphs [and] there's really no way to predict when it will happen." And it's a problem that will only continue to grow as graphs swell in size as new data and metadata are added to the system.
Urika-GD's solution to the problem is to store the entire graph in a single large memory, Slater said. The smallest Urika-GD machine comes equipped with a healthy two terabytes of RAM and the largest system comes with 512 terabytes of RAM. What that means is that no matter how big the graph gets, there's enough space for the entire structure to reside in a single global shared space during computation and all parts are available to all the processors at any time.
Furthermore, each processor in the system is equipped with 128 hardware threads, each of which can handle computations on the graph. If a single thread stops working for some reason, there are 127 other threads on the same processor still working on the graph, Slater said. For context, the lower-end Urika-GD machine comes with 64 such processors installed. The higher-end appliance can have more than 8,000 graph accelerator processors, each of which contains 128 independent threads.
A second and more recent addition to Cray's data analytics portfolio is the Urika-XA extreme analytics platform. Launched last December, it's a turnkey system that is optimized to handle both compute-heavy and memory-centric analytics operations, according to Cray.
Although it relies on the same technology, Urika-XA is a very different system from Urika-GD, according to Slater. It's a rack of computer power for running highly parallelized compute jobs quickly that includes plenty of RAM for in-memory computation, and boasts an "interesting" memory hierarchy with lots of solid-state drive storage, he said. Since it's a single rack, it doesn't take up much space and does not require as much power to run. The solution comes pre-installed with the Apache Hadoop and Apache Spark software — making it a particularly good fit for researchers already using these tools to process their data — and it has a footprint of 48 compute nodes per 42U rack and more than 1,500 cores and 6 terabytes of RAM per rack.
Cray is targeting these systems primarily at pharma companies but it's also interested in placing them within large healthcare organizations and sequencing centers — "basically any place where an organization has lots of data to compute over and they need to do it quickly," Slater said. There's plenty of opportunity for high computing in these areas and "Cray has lots of experience with that."
The company does have a long history in the space and its systems rank well on Top500 lists. In the last ranking, which came out in November 2014, three of its systems placed in the top ten with Titan, a Cray system housed at the Oak Ridge National Laboratory, coming in as the second fastest system in the world.
"The thing that you are afforded by using hardware that's up to the task is [that] a lot of the computation that was not done in real time, you can do in real time with a big machine," Slater noted. So a job that would require three days to run on standard compute can be computed in an hour or less on some of these systems.
Moreover, Cray helps its customers "future-proof" their IT infrastructure and readily modifies it's systems to suit clients' analytics needs as they evolve, Slater added. That's especially pertinent for bioinformatics where new software applications rapidly replace older ones. Hardware refresh rates, on the other hand, are much longer, on the order of two to three years.
It's hard for organizations to predict what their data and data processing needs might be in the near future. "It's quite risky to buy commodity clusters or whatever hardware you are using knowing full well that next quarter your data needs or processing needs might be very different," Slater said.
Simply buying cheap servers and plugging them into a commodity cluster to increase scalability results in "cluster sprawl" and "you have to worry about your networking and how those things are connected together," he added, noting that with a system like Urika-XA "you are really optimizing the amount of power you have in one rack."
Cray's solutions are also "pretty beefy as they are and can scale," Slater added. "So even if your refresh rate is longer, if you need to add something to your system, it's always easy to do with Cray" so IT departments can keep running with no gaps in production.
Budgetary constraints are a concern for some and the cost of installing and running these systems including electricity and cooling costs as well as space requirements for some of the larger systems is not trivial. Cloud infrastructure, which has grown in popularity within the life sciences, provides cheaper access to virtually unlimited compute. .
Cray's systems are often compared to cloud infrastructure, Slater said, but there are benefits to both and ultimately the decision about which system to use depends on what the specific needs of the final consumer are. Cost concerns aside, it depends on "how much data you have and whether you are [comfortable with] packing data into a bunch of hard-drives and shipping it off or you want to keep it in house," he said. Other things to consider are overhead costs each time cloud resources are provisioned and the time it takes to upload large quantities of data to the cloud. There's also resource utilization. As the number of cycles or percent utilization increases over time, an on-premises computation system might make more sense, Slater said.