Interactive Analytics for Very Large Scale Genomic Data
Somalee Datta
Specialization in petascale computing for health and biotech research applications; Broad experience in healthcare research, genomics, drug design, privacy and everything in between...
Stanford University, Epidemiological Research and Information Center (ERIC) for Genomics at VA Palo Alto, and Google Genomics in a collaborative effort show use of a low cost database, Big Query, for very large scale variant analytics.
Our manuscript on the pre-print server shows the end-to-end workflow for variant mining. As a pedagogic tool, we show how to run variant QC but the data model supports typical biological queries. Most notably we show scaling and cost effectiveness. Most queries take a few seconds (as opposed to an hour or two on a server or cluster) - this makes data exploration interactive as opposed to batch mode. Interactiveness allows a new flexibility to hypothesis development and testing that can't be achieved by batch mode.
At Stanford, our mission is to bring solutions to researcher, ours and rest of the world, that not only meets workflow requirements, but is easy to learn, easy to manage (doesn't need an army of IT professionals) and is cost effective (can be supported by typical level NIH fundings).
Our solution is accessible to anyone on Google Cloud. But the underlying data models and queries can be replicated using a columnar database like Dremel (e.g. Apache Drill).
Please leave your comments on our methods on the pre-print server.
Software Architect
9 年This system really looks promising.
Bioinformatics & Data Science Professional
9 年Nice system! How does the system perform on low allele frequency of somatic samples?