登录查看更多内容

Garage Genomics helps fix a bug in a large population genomics database (gnomAD)

Mehis Pold

Sr. Scientist at QIAGEN, Clinical Informatics

发布日期: 2018年6月11日

By Mehis Pold, MD, June 11, 2018

DNA tests interrogating large numbers of genes are very complex and the interpretation thereof makes extensive use of large population genomics databases and custom-built software. Therefore, their accuracy and interpretation are a function of wet-lab quality as well as the databases and software employed. The current article exemplifies the impact of software, written by the same group at Broad Institute, on the content of large population genomics databases. Specifically, the production software of ExAC and gnomAD utilizes different code to handle the population specific maximum allele frequency values (POPMAX). Consequently, the current versions of these two very closely related databases contain a systemic difference. Namely, only gnomAD includes the Finnish population (FIN) in calculating POPMAX.

My communications with Dr. Daniel MacArthur helped explain the ExAC and gnomAD POPMAX discrepancy. Namely, the inclusion of the Finnish population in gnomAD POPMAX resulted from a software bug. Like ExAC, the POPMAX-calculation of gnomAD was supposed to exclude Finns as a population, as their bottle-necked history results in some pathogenic alleles being unusually common. In brief, if you are using both ExAC and gnomAD in either your research or clinical data interpretation then you should keep in mind that there is a POPMAX-bug in gnomAD that does not take into account the bottle-necked history of Finns.

What would be the impact of the above observation, and the wider story?

The immediate impact of my observation and subsequent communications with Dr. MacArthur is that the POPMAX-values of gnomAD are being recalculated (communication from Dr. MacArthur). I am looking forward to the new version of gnomAD.

The wider framework for my observation is that an independent pair of eyes is useful. Genomics is still a very young field, which operates without formal standards and software that would enable comprehensive quality control. Hence, anyone who is willing to put the existing data through a challenge is highly welcome in the field.

Genome-based tracing of one’s ancestry has become very popular in recent years. It is not uncommon, however, that the ancestry results are confusing, discrepant from the expected. Additionally, different consumer genomics-companies can provide very different ancestry results. Usually, the confusing results are attributed either to our limited knowledge of biology or the computer algorithms that quantify one’s genome being X, Y and Z% from this or that geographic region of the world. As the current POPMAX example provides, your ancestry can also be a function errors that slip by the code-writers - they can modify your family history.

Large population genomics databases are invaluable in genomics including clinical genomics. They are, however, only as good as the source data and the software that produces them. If you find something doubtful in your genomics data – ancestry, clinical report etc. – seek out an independent analysis because you can learn something novel that helps explain your concerns.

Currently, a bug-free genomics software does not exist because of the complexity of biology as well as the code that interprets it. I design and build genome annotation algorithms and software. Periodically, I discover impactful bugs in the popular genomics software and databases (1, 2, 3) that can lead to substantial interpretation errors of clinical genomics data. I urge anyone else to do the same thing because a collective, shared effort is the fastest way to minimize the genomics errors.

Discovery of the gnomAD POPMAX-bug, step-by-step:

While integrating ExAC in the back-end of my genome annotation pipeline (Garage Genomics), I noticed that the populations with maximum allele frequencies that Garage Genomics (GG) computes do not always match the POPMAX value in ExAC source file (ExAC.r1.sites.vep.vcf). Quantitatively, the POPMAX discrepancy is 3.4%. Curiously, the POPMAX discrepancy is associated with the Finnish population (FIN) in 99.74% discrepant cases (Table 1). Clearly, the data in Table 1 are non-random.

Table 1. POPMAX discordance between ExAC source file and (ExAC.r1.sites.vep.vcf) and results computed by GG. Each variant in ExAC produces either a single or multiple POPMAX values. Multiple POPMAX values are assigned to an ExAC variant if more than a single ExAC-population (see Table 2) produces the maximum population-specific AF. GG produced a total of 308,181 POPMAX-values discordant with those in ExAC (3.4%). In 99.74% cases, the observed POPMAX-discordance is associated with Finnish allele frequencies.

In order to determine the cause of the discrepancy between GG and ExAC, I analyzed a subset of discrepant data in more detail. A very straightforward pattern emerged. Namely, the software that produced the ExAC source file (ExAC.r1.sites.vep.vcf), had assigned the POPMAX value to the population with the second highest AF instead of the maximum AF (data not shown).

The next step for me was to analyze the gnomAD POPMAX as well because ExAC is the predecessor of gnomAD. Curiously, the gnomAD POPMAX did not produce the same discrepancy as ExAC. So, my initial thought was that there is a systemic error in ExAC. With that in mind, I contacted Dr. MacArthur. It turned out, however, that ExAC is correct and gnomAD is erroneous. Currently, the gnomAD POPMAX-values are being recalculated (see above).

The perl-scripts that I used to compute the ExAC and gnomAD POPMAX are available in GitHub.

· ExAC: https://github.com/mpold/POPMAX/blob/master/ExAC

· gnomAD: https://github.com/mpold/POPMAX/blob/master/gnomAD

Acknowledgement: My gratitude goes to Dr. Daniel MacArthur for reviewing the observation described in this article.

Mehis Pold

Sr. Scientist at QIAGEN, Clinical Informatics

6 年

No clue. I fixed the bug by myself. If the consistent version of gnomAD is of your interest then perhaps we can arrange a file transfer. I would not even be surprised if they are working on something larger than gnomAD and won't even bother to fix it.

Christian Neckelmann

Clinical Science Liaison at Fulgent Genetics

6 年

Any idea when they will release an updated version of the dataset?

1 次回应

查看更多评论

要查看或添加评论，请登录

Mehis Pold的更多文章

Idiosyncrasies of human genome and disease databases: disease associations in ClinVar related to non-variant DNA?

2018年5月22日

Idiosyncrasies of human genome and disease databases: disease associations in ClinVar related to non-variant DNA?

A week ago I used the entire ClinVar database (clinvar_20180128) as input for the genome analysis pipeline I have…

3 条评论
Rare events or not, I want to know about them because they are impactful

2017年11月16日

Rare events or not, I want to know about them because they are impactful

Clinical exome sequencing studies produce diagnostic yields in approximately 30% of hereditary disease cases. As also…
Comprehensive interrogation of the key gene-features should be standard in the validation of genome annotation pipelines

2017年10月25日

Comprehensive interrogation of the key gene-features should be standard in the validation of genome annotation pipelines

Genome annotation pipelines (GAP) are complex tools employed in both fundamental and clinical genomics. They map genome…

7 条评论
Popular genomics tools produce errors propagated in major population genetics databases

2017年5月7日

Popular genomics tools produce errors propagated in major population genetics databases

Dec. 1, 2017 update: The failure of Variant Effect Predictor to correctly annotate the 'start-lost' variants as…

4 条评论
How clear is the output of your genomics pipeline?

2017年4月16日

How clear is the output of your genomics pipeline?

Anyone who has ever dealt with genomics software can relate to the situation on the above illustration – gibberish that…

1 条评论
Garage Genomics produces a pleasant surprise

2017年4月6日

Garage Genomics produces a pleasant surprise

Figure 1. The transcriptome consequences computed by Garage Genomics v1 (GGv1) outnumber the consequences computed by…

2 条评论
The usual suspects, a.k.a. the expected results, are highly welcome in genomics

2017年3月31日

The usual suspects, a.k.a. the expected results, are highly welcome in genomics

Figure 1. Enrichment for null-variants in human disease-associated genes.

3 条评论
It converts, with some loss of information

2017年2月20日

It converts, with some loss of information

A couple of weeks ago I tried to figure out if Exome Aggregation Consortium (ExAC) variation database exists publicly…

4 条评论

See all articles

Garage Genomics helps fix a bug in a large population genomics database (gnomAD)

Mehis Pold

Sr. Scientist at QIAGEN, Clinical Informatics

By Mehis Pold, MD, June 11, 2018

Mehis Pold的更多文章

社区洞察

其他会员也浏览了

Rise in Population Genomics: Local Government in India Will Use Blockchain to Secure Genetic Data

Harmonized single-cell perturbation data ?? New protein folds in the virome ?? Reproducible genome assembly in Galaxy ?? AI's impact on genomics ??

Mastering Phylogenetic Analysis in 2025

How is AI Transforming the Genomics Industry?

OmicsLogic Africa & UREKA Bioinformatics & Data Science Programs 2024: Scholarships For All From Africa!

Fast Pangenome Annotation with ggCaller ?? Limits of Zero-Shot Models in Single-Cell Biology ?? BTR: Your Bioinformatics Tool Recommender ???

Bioinformatics and Beyond: September 2024

Unraveling the Mysteries of GenBank: A Bioinformatic Wonderland ????

Senior Bioinformatics Scientist

By Mehis Pold, MD, June 11, 2018

Mehis Pold的更多文章

Idiosyncrasies of human genome and disease databases: disease associations in ClinVar related to non-variant DNA?

Rare events or not, I want to know about them because they are impactful

Comprehensive interrogation of the key gene-features should be standard in the validation of genome annotation pipelines

Popular genomics tools produce errors propagated in major population genetics databases

How clear is the output of your genomics pipeline?

Garage Genomics produces a pleasant surprise

The usual suspects, a.k.a. the expected results, are highly welcome in genomics

It converts, with some loss of information

社区洞察

其他会员也浏览了

Rise in Population Genomics: Local Government in India Will Use Blockchain to Secure Genetic Data

Harmonized single-cell perturbation data ?? New protein folds in the virome ?? Reproducible genome assembly in Galaxy ?? AI's impact on genomics ??

Mastering Phylogenetic Analysis in 2025

How is AI Transforming the Genomics Industry?

OmicsLogic Africa & UREKA Bioinformatics & Data Science Programs 2024: Scholarships For All From Africa!

Fast Pangenome Annotation with ggCaller ?? Limits of Zero-Shot Models in Single-Cell Biology ?? BTR: Your Bioinformatics Tool Recommender ???

Bioinformatics and Beyond: September 2024

Unraveling the Mysteries of GenBank: A Bioinformatic Wonderland ????

Senior Bioinformatics Scientist