Skip to main content
Premium Trial:

Request an Annual Quote

Mining Data Mountains

NIH Director Francis Collins says research projects like one Stanford University's Atul Butte recently completed are revealing the power to be found in the mountains of data that genomics, imaging, and electronic medical record technologies are generating.

Butte has mined those mountains in search of links between genes, diseases, and traits, like cholesterol levels, that could provide the basis for new markers for predicting disease risk, Collins writes in his Director's Blog.

Butte and his team searched through the VARIMED database, a resource of GWAS and clinical trait data that he started building six years ago, to identify all of the genetic variants that influence risks for a number of diseases and to create a list of disease-gene pairs.

His team found 801 genes that were reliably linked to 69 diseases, and they also made a list of 796 genes that are reliably linked to 85 traits. They then compared these two lists searching for overlaps. They found 120 diseases and traits are linked by the activity of just a few genes.

Although many of these disease-trait associations were known, around 20 percent of them were novel.

Butte's group then wanted to know whether these traits could be used to predict whether an individual would develop a disease, and started examining a decade worth of patient data from the electronic medical records from three major research hospitals.

The team was able to validate links between elevated magnesium levels and gastric cancer, high PSA levels and lung cancer, the average volume of red blood cells and the risk of developing acute lymphoblastic leukemia, phosphatase levels and blood clots, and low platelet counts and alcohol dependency.

"What I find most noteworthy about this work is not the specific findings, but how the researchers demonstrate the feasibility of mining vast troves of existing data—genetic, phenotypic, and clinical—to test new hypotheses," Collins writes.