The Bioinformatics race to Cloud hosted HPC, AI and ML
This post covers a bit of what we at Stratogent have seen and assisted with regarding migrating Bioinformatics customers to the Cloud. This has allowed our customers to take advantage of High-Performance Computing (HPC), AI, and ML cloud-based solutions. We also cover a bit of where we see the industry headed, particularly with cloud-hosted Bioinformatics services and tools.
Bioinformatics data storage and processing models
For the past decade, Bioinformatics organizations have been moving data, processing, and analytics to the Cloud.? Some newer organizations have only used the Cloud, and thus have no need to migrate, as they are already there.
With regard to where data should be hosted, with a robust enough Internet connection between on-prem Lab Instruments and a Cloud hosting provider to which output from the RNA/DNA sequencing instruments can be written, there is no longer a need for on-prem storage. There may be a requirement for backups, to have source copies of FASTA and FASTQ files backed up, for example, but there is not a need for duplicate uncompressed source files scattered around between on-prem and the Cloud.?
Until recently, the model for storing data has been for sequencer instruments to write their output files to on-prem storage.? Only after that output was written to disk would it be copied or synced to Cloud hosted storage as part of a separate process.? This writing to on-prem storage and copying to Cloud storage effectively then having two copies of the data, is no longer necessary.? Sequencer instruments can write data directly to Cloud hosted storage, for example to AWS File Gateway hosted Windows SMB shares, using Amazon's File Gateway product.?
Primary, secondary, and tertiary data analysis can be handled right there in the cloud, for example using AWS HealthOmics service workflows to analyze DNA or RNA sequences.
Currently, the sequencer instruments themselves must be hosted on-prem.? The world has not yet reached a point where the sequencer instruments can be hosted in AWS data centers, along with a chain of custody for shipping samples to AWS... but at this point, why not?? The writer of this post has seen a case where firmware updates on cellphones were tested remotely against physical cellphones being hosted in a datacenter... so why not modified, hosted Illumina DNA Sequencers or Thermo Fisher CADs?? That may be next, as Bioinformatics continues moving to the Cloud.
Although sequencer instruments cannot yet be hosted in the Cloud, Bioinformatics software such as Laboratory Information Management Systems (LIMS), which manages instruments, workflows, sample tracking, and data management, can be. An example of this would be Thermo Fisher's Cloud LIMS Deployment.
And Chromatography Data Systems (CDS) software, such as Waters's Empower 3.8.0, for collecting, managing, and reporting chromatography test results, can also be hosted in the Cloud, and connected to on-prem Lab workstations running the Empower client.
领英推荐
Cloud native Bioinformatics services
All major Cloud vendors have backing Binary Large Object (BLOB) storage of some kind for storing files - Azure has "Azure Blob Storage", Amazon has S3, and Google Cloud Platform (GCP) has "Cloud Storage".
Data that has been migrated to this storage can be operated on a variety of Cloud native services.? There is no single Cloud native service that will do everything.? The magic lies in building a pipeline and chaining multiple Cloud native services together to accomplish a useful task.? Success lies in understanding the task objective, a deep understanding of the Genomics and Life Science services offered by the Cloud provider, knowing when to choose one service or Cloud provider over another, and the creativity to chain these services together to accomplish the task.? There is no one right way to do this - but there are plenty of wrong ways.? Some examples of the right ways to do this can be found in the online AWS Whitepaper "Genomics Data Transfer, Analytics, and Machine Learning using AWS Services".
High Performance Computing (HPC), which had traditionally been clustered nodes on-prem handling high speed processing and calculating, is also available as a service from Cloud providers. As examples, AWS offers both Parallel Cluster 3 and AWS Batch, while Azure offers "Azure Batch HPC for Genomics".? These services can leverage vast amounts of computing power to analyze and process genome sequences.?
Machine Learning (ML) and Deep Learning, which are subsets of Artificial Intelligence (AI) are rapidly becoming staples in Bioinformatics research.? Some examples are in predicting how cancers will progress in patients, and identifying which genomic variants mean an increased risk of developing cancer.? Cloud providers offer unified dashboard interfaces for Data Scientists to work with ML/AI.? AWS's "Amazon SageMaker Studio" is an example of a Cloud hosted tool in a unified interface, which includes JupyterLab, RStudio, and tools for managing Machine Learning (ML) modes, building generative AI applications, and using custom Python and R kernels for Jupyter Notebook development.
Data Scientist end-users will find that there is a learning curve to use the Cloud native ML/AI tools, but the learning curve is not steep, and Cloud providers make this easy, with well-documented example code online, as well as JumpStart tutorials.
In Conclusion…
The Bioinformatics industry is at an intersection point. Data and applications that are being moved to the Cloud or that were generated and built in the Cloud to start with are being integrated with cloud-native pipelines, services, Cloud HPC computing power, and new AI/ML services to quickly process and analyze genomic data. Stratogent is proud to be part of this evolution in genomic data processing speed.
Project Leader at DrugSense Analytics
9 个月Hello, We are DrugSense Analytics are specialize in:?Drug Discovery Analytics Docking Studies Cheminformatics Bioinformatics NGS Data Analytics Please feel free to contact us at [email protected] or call us at +91 736 2968 185 For more information, you can also visit our website at www.drugsenseanalytics.com