How to Validate Hypothesis using Public Data

How to Validate Hypothesis using Public Data

The Importance of Validating Hypothesis Using Public Data

Every year, approximately thousands of drugs enter clinical trials. Only one in about 5000 drug candidates gets FDA approval. It takes about 10 years and approx. three billion dollars to take a compound from bench to bedside. In the period from 2010-2019, the FDA approved, on average, 38?new drugs per year (with a peak of 59?in 2018), which is 60?percent more than the yearly average over the previous decade. This just puts in perspective the number of drug candidates that go through clinical trials and fail. Even a delay of one day to get approval for a drug can cost a company $1.26 million, begging the need for accelerated processes.

No alt text provided for this image

What Does Target Identification Entail?

Identifying a novel therapeutic target is no trivial feat, it takes digging wide and deep into the literature, making the right links and connections, and this goes without saying- a good understanding of disease pathology and molecular interactions aka expert skills. Once a therapeutic target is identified, testing and validating it is another long-winded, tedious, and resource-heavy process.

After a target is identified, here are two ways to go about testing and validating it- a) either run your own experiments to establish the findings or b) just go through existing publications to see if it has been done before and if it attests to your hypothesis. Meta-analysis of published studies is usually the go-to strategy for gathering information for formulating a hypothesis. The figure below is an example of a typical process of identifying the relevant studies that could be useful for target identification. As shown, the authors searched for a broader term that could be a disease, exclusion and inclusion criteria are applied, and the irrelevant papers/studies are excluded.

The number of publications that meet all the criteria is substantial in this particular case, however, it is not the case all the time. For example, in this meta-analysis they found 67,000 plus relevant publications, however, they were only able to find 75 that met all their criteria. This process is time and resource-heavy and leads to few results owing to the non-clean data present in the papers.

No alt text provided for this image
Flow diagram of studies identified in the systematic review and meta-analysis.

Instead of reinventing the whole wheel, one of the better and faster ways to test a hypothesis is to utilize the existing resources (GEO, TCGA, UK BioBank, etc., depending on the research question at hand), the vast treasure trove of experimental findings from the scientific community - Public Data. The treasure trove, Public Data, where researchers worldwide have already recorded the findings from their own experiments- from testing new compounds to molecular analysis and results from animal testing.

What’s the Catch?

The available humongous volumes of data, though precious, is not utilized to accelerate the drug discovery and target identification process to even a fraction of its potential.

There are several factors responsible for this: the data is heterogenous, not annotated using standard guidelines, and is recorded and stored in various syntaxes, schemas and formats, making it challenging just to find what you’re looking for.

The incompleteness of available data is what makes it challenging to search and find.

Additionally, the confidence that one has scoured through all possible existing data describing the effect of overexpression of a particular gene, say KRAS, in pancreatic cancer and establishing KRAS as a potential therapeutic target can come only when one has gone to each public repository, searched for all datasets on pancreatic cancer mentioning the upregulation of KRAS and finding them. This sounds great as an idea but the sheer volume of data out there, and its unstructured and non-annotated nature makes this an extremely daunting and challenging task.

However, if this entire process of searching for existing data across repositories, finding harmonized, annotated, and linked data, which is structured and ready for downstream analysis, is consolidated on one platform - imagine the amount of time and resources of highly skilled researchers can be saved! Not only that, the chunk of the drug discovery pipeline that is automated using this approach can then be redirected toward analysis and insight derivation.

The Good News is..

..that in the past decade or so, with the emergence of TechBio companies - companies that solve biological research questions using cutting-edge technology- cloud platforms that can be used to automate drug discovery pipelines are no longer a dream. They’re out there, they’re good and they offer so much value in terms of data warehousing, management, collaboration, and analysis.

Validating the hypothesis to establish an identified target as a good candidate for a specific disease is an essential and critical step. What is at stake is a number of highly valuable resources that ought to be judicially used in this post-COVID-19 era where we have learned the importance of accelerated target identification, data sharing and collaboration, and FDA approvals. We can no longer afford to go the usual 10-year route to get a novel therapeutic from bench to bedside. Automation and integration of cloud platforms based on AI/ML is the first step in this direction which has already shown a lot of promise in drug discovery.

I'm thinking that having the fastest hash table for big data might help with this. ??

回复

要查看或添加评论,请登录

Swetabh Pathak的更多文章

社区洞察

其他会员也浏览了