Garage Genomics helps fix a bug in a large population genomics database (gnomAD)

By Mehis Pold, MD, June 11, 2018

DNA tests interrogating large numbers of genes are very complex and the interpretation thereof makes extensive use of large population genomics databases and custom-built software. Therefore, their accuracy and interpretation are a function of wet-lab quality as well as the databases and software employed. The current article exemplifies the impact of software, written by the same group at Broad Institute, on the content of large population genomics databases. Specifically, the production software of ExAC and gnomAD utilizes different code to handle the population specific maximum allele frequency values (POPMAX). Consequently, the current versions of these two very closely related databases contain a systemic difference. Namely, only gnomAD includes the Finnish population (FIN) in calculating POPMAX. 

My communications with Dr. Daniel MacArthur helped explain the ExAC and gnomAD POPMAX discrepancy. Namely, the inclusion of the Finnish population in gnomAD POPMAX resulted from a software bug. Like ExAC, the POPMAX-calculation of gnomAD was supposed to exclude Finns as a population, as their bottle-necked history results in some pathogenic alleles being unusually common. In brief, if you are using both ExAC and gnomAD in either your research or clinical data interpretation then you should keep in mind that there is a POPMAX-bug in gnomAD that does not take into account the bottle-necked history of Finns.

What would be the impact of the above observation, and the wider story?

The immediate impact of my observation and subsequent communications with Dr. MacArthur is that the POPMAX-values of gnomAD are being recalculated (communication from Dr. MacArthur). I am looking forward to the new version of gnomAD.

The wider framework for my observation is that an independent pair of eyes is useful. Genomics is still a very young field, which operates without formal standards and software that would enable comprehensive quality control. Hence, anyone who is willing to put the existing data through a challenge is highly welcome in the field.

Genome-based tracing of one’s ancestry has become very popular in recent years. It is not uncommon, however, that the ancestry results are confusing, discrepant from the expected. Additionally, different consumer genomics-companies can provide very different ancestry results. Usually, the confusing results are attributed either to our limited knowledge of biology or the computer algorithms that quantify one’s genome being X, Y and Z% from this or that geographic region of the world. As the current POPMAX example provides, your ancestry can also be a function errors that slip by the code-writers - they can modify your family history.

Large population genomics databases are invaluable in genomics including clinical genomics. They are, however, only as good as the source data and the software that produces them. If you find something doubtful in your genomics data – ancestry, clinical report etc. – seek out an independent analysis because you can learn something novel that helps explain your concerns.

Currently, a bug-free genomics software does not exist because of the complexity of biology as well as the code that interprets it. I design and build genome annotation algorithms and software. Periodically, I discover impactful bugs in the popular genomics software and databases (1, 2, 3) that can lead to substantial interpretation errors of clinical genomics data. I urge anyone else to do the same thing because a collective, shared effort is the fastest way to minimize the genomics errors.

Discovery of the gnomAD POPMAX-bug, step-by-step:

While integrating ExAC in the back-end of my genome annotation pipeline (Garage Genomics), I noticed that the populations with maximum allele frequencies that Garage Genomics (GG) computes do not always match the POPMAX value in ExAC source file (ExAC.r1.sites.vep.vcf). Quantitatively, the POPMAX discrepancy is 3.4%. Curiously, the POPMAX discrepancy is associated with the Finnish population (FIN) in 99.74% discrepant cases (Table 1). Clearly, the data in Table 1 are non-random.

Table 1. POPMAX discordance between ExAC source file and (ExAC.r1.sites.vep.vcf) and results computed by GG. Each variant in ExAC produces either a single or multiple POPMAX values. Multiple POPMAX values are assigned to an ExAC variant if more than a single ExAC-population (see Table 2) produces the maximum population-specific AF. GG produced a total of 308,181 POPMAX-values discordant with those in ExAC (3.4%). In 99.74% cases, the observed POPMAX-discordance is associated with Finnish allele frequencies.

In order to determine the cause of the discrepancy between GG and ExAC, I analyzed a subset of discrepant data in more detail. A very straightforward pattern emerged. Namely, the software that produced the ExAC source file (ExAC.r1.sites.vep.vcf), had assigned the POPMAX value to the population with the second highest AF instead of the maximum AF (data not shown).

The next step for me was to analyze the gnomAD POPMAX as well because ExAC is the predecessor of gnomAD. Curiously, the gnomAD POPMAX did not produce the same discrepancy as ExAC. So, my initial thought was that there is a systemic error in ExAC. With that in mind, I contacted Dr. MacArthur. It turned out, however, that ExAC is correct and gnomAD is erroneous. Currently, the gnomAD POPMAX-values are being recalculated (see above).

The perl-scripts that I used to compute the ExAC and gnomAD POPMAX are available in GitHub.

·        ExAC: https://github.com/mpold/POPMAX/blob/master/ExAC

·        gnomAD: https://github.com/mpold/POPMAX/blob/master/gnomAD

Acknowledgement: My gratitude goes to Dr. Daniel MacArthur for reviewing the observation described in this article.

Mehis Pold

Sr. Scientist at QIAGEN, Clinical Informatics

6 年

No clue. I fixed the bug by myself. If the consistent version of gnomAD is of your interest then perhaps we can arrange a file transfer. I would not even be surprised if they are working on something larger than gnomAD and won't even bother to fix it.

回复
Christian Neckelmann

Clinical Science Liaison at Fulgent Genetics

6 年

Any idea when they will release an updated version of the dataset?

要查看或添加评论,请登录

Mehis Pold的更多文章

社区洞察

其他会员也浏览了