Nvidia and Harvard find active areas in cell DNA with AI
Harvard and Nvidia will soon be releasing research on their latest work together on a way they've applied deep learning to epigenomics -- the study of modifications on the genetic material of a cell - to study how specific kinds of cells are affected by diseases and genomic variation in the human body.
The deep learning toolkit is called AtacWorks. It was was originally a neural network designed for computer vision. Now, AtacWorks "allows us to study how diseases and genomic variation influence very specific types of cells of the human body," Nvidia researcher Avantika Lal, lead author on the paper, told reporters last week. "And this will enable previously impossible biological discovery, and we hope would also contribute to the discovery of new drug targets."
AtacWorks employs ATAC-seq, which is a known method that discovers which parts of the genome can be accessed through human cells. Genomes are genetical materials of any given organism. All cells are created from a single cell, so all cells have identical genome sequence which is about 3 billion bases long!. However, certain types of cells can only access certain parts of the genome that they would require for their function.
"That allows us to understand what makes every type of cell different from each other, or how every type of cell is affected in disease, or in other biological changes," Lal said.
While ATAC-seq has been successful in finding which cell is accessible by a certain part of a DNA by tagging a signal of the genome base, it requires thousands of cells to receive a clear signal. This makes it hard to ATAC-sec to study rare cells such as stem cells which produce cells and platelets. With AtacWorks applied to the ATAC-sec data, the signal can be found from a handful of cells, too! In the paper, the team of researchers described how they applied the AttacWorks to a dataset of ATAC-sec of only 50 stem cells. They were successful in identifying genome-sequence related to producing white blood cells and sequences that help produce red blood cells.
This approach can help decode a range of diseases, including cardiovascular disease, Alzheimer's disease, diabetes or neurological disorders.
AtacWorks is a neural network based on Pytorch that was trained on labelled pairs of matching ATAC-seq datasets of one high-quality and noisy dataset. The model learned to predict an accurate, high-quality version of a dataset and identify peaks in the signal.
The model was able to unravel an entire genome in 30 minutes on Nvidia Tensor Core GPUs. Ideally, it would take a system to work this kind of data in 15 hours with 32 CPU cores. "That's a really wonderful thing because it means that we can train models using whatever data we have available and then apply it to entirely new biological samples," she said.
"We are hoping that once our paper comes out, other scientists working with different diseases would also pick up this technique and be interested in using it," Lal said. "And we are excited to see what new research and new developments that can enable."
Credits : https://indiaai.gov.in