SANTA CRUZ -- Despite some successes, predicting cancer outcomes based on the molecular signatures in cancer cells remains a major challenge.
A new effort, funded by the National Cancer Institute and led by researchers at the UC Santa Cruz, aims to clear several key roadblocks that have stymied progress in this field.
The $3.5 million project will use the latest in "big data" technology to bridge the gap between the petabytes of raw genomic data in centralized repositories like UCSC's Cancer Genomics Hub and the higher levels of interpretive information that can lead to clinically useful predictions, such as which drugs are most effective against tumors with certain mutations. Project leader Joshua Stuart, an associate professor of biomolecular engineering at UCSC's Baskin School of Engineering, compares the raw genomic data to the binary code running on a computer.
"Your web browser doesn't understand zeros and ones. There are layers and layers of software programs between that and what you see on a web page. We need to do the same thing for DNA sequences to reach the higher levels of interpretation needed for scientific discovery," Stuart said.
Stuart's group will build a separate database, called the Biomedical Evidence Graph, for storing and analyzing interpretive information derived from the raw sequence data stored in the CGHub. Like Facebook's social graph, the BMEG will use a graph database structure designed for lightning-fast access to complex, interconnected datasets.
"Our analyses can reveal connections between different tumor samples based on their molecular profiles, and the natural way to represent that in a database is with the graph structures used for Facebook and other social networks," Stuart said.
A UCSC team led by bioinformatics expert and BMEG co-investigator David Haussler established CGHub in 2012 to manage data from the Cancer Genome Atlas consortium and other NIH cancer genomics research programs. Because CGHub holds genome sequences from thousands of individual patients, access is strictly controlled and limited to researchers approved by NIH. But the BMEG will hold higher-level data derived from analyses of the raw genome sequences and will not require the same level of security restrictions.
"The idea is to build a shared knowledge base and create a playground where lots of researchers can interact, test their algorithms, and compare results," Stuart said. "TCGA researchers have built a lot of great tools for data analysis, and we need to get those installed in the BMEG so the rest of the world can engage in that higher level analysis."
The BMEG complements a parallel project, called Medbook, which will link together patients, biopsy samples, doctors, and researchers into a social network framework.