Interactive Analytics for Very Large Scale Genomic Data
Gregory McInnes, Stanford University

Interactive Analytics for Very Large Scale Genomic Data

Stanford University, Epidemiological Research and Information Center (ERIC) for Genomics at VA Palo Alto, and Google Genomics in a collaborative effort show use of a low cost database, Big Query, for very large scale variant analytics.

Our manuscript on the pre-print server shows the end-to-end workflow for variant mining. As a pedagogic tool, we show how to run variant QC but the data model supports typical biological queries. Most notably we show scaling and cost effectiveness. Most queries take a few seconds (as opposed to an hour or two on a server or cluster) - this makes data exploration interactive as opposed to batch mode. Interactiveness allows a new flexibility to hypothesis development and testing that can't be achieved by batch mode.

At Stanford, our mission is to bring solutions to researcher, ours and rest of the world, that not only meets workflow requirements, but is easy to learn,  easy to manage (doesn't need an army of IT professionals) and is cost effective (can be supported by typical level NIH fundings). 

Our solution is accessible to anyone on Google Cloud. But the underlying data models and queries can be replicated using a columnar database like Dremel (e.g. Apache Drill).  

Please leave your comments on our methods on the pre-print server.

Madhavi Tikhe

Software Architect

9 年

This system really looks promising.

回复
Quoclinh Nguyen

Bioinformatics & Data Science Professional

9 年

Nice system! How does the system perform on low allele frequency of somatic samples?

回复

要查看或添加评论,请登录

Somalee Datta的更多文章

社区洞察

其他会员也浏览了