Scientists Find a Black Hat for Deep Learning Genomics Applications
The Basic Framework of DARTS
For the first time, researchers combined in-depth learning with Bayesian hypothesis testing to enhance the accuracy of RNA variable splicing analysis.
Journalist Zhao Guangli
In the field of life science, it is often said that the application of deep learning genomics is like "a blind man looking for a black hat that does not exist in a dark house". In other words, it is regrettable that the application of genomics in deep learning has not brought too many surprises to people. However, a recent study by Xing Yi, a professor at the University of Pennsylvania and the Philadelphia Children's Hospital, found such a "black hat".
This paper, published in Nature-Method, proposes a new computing framework, DARTS (abbreviation for "Variable Splicing Analysis of RNA-seq with Deep Learning Enhancement". For the first time, the framework combines in-depth learning with Bayesian hypothesis testing for variable splicing analysis of RNA. This combination enables it to effectively improve the accuracy of quantitative differential splicing of RNA-seq even for samples with lower sequencing depth.
Zhang Qiangfeng, a professor at Tsinghua University's School of Life Sciences, commented: "DARTS combines the advantages of in-depth learning and Bayesian hypothesis test statistical model, provides better means for variable splicing analysis for those data with low sequencing depth, and expands the sensitivity and accuracy of traditional RNA-seq variable splicing analysis."
Computational genomics
A widespread concern
Xing Yi et al. pointed out in the above-mentioned paper that RNA-seq technology is the most commonly used experimental means to study RNA splicing. However, although RNA-seq technology can quantify the results of gene expression better, it depends on higher sequencing depth for differential splicing analysis. Even so, the existing calculation methods can not accurately quantify the splicing changes of low-expression genes. Therefore, in order to improve the accuracy of splicing quantification, it is urgent to introduce new calculation and analysis methods.
"Since the discovery of alternative splicing in the 1970s, the basic scientific issues have focused on the discovery of alternative splicing sites, the analysis of differences, the discovery and construction of regulatory elements and networks. The invention of RNA-seq technology makes it possible to systematically and quantitatively analyze variant splicing differences. Zhang Qiangfeng said that variable splicing variance analysis of a large number of sequencing data requires excellent statistical models and computational tools, so it has always been a highly skilled bioinformatics research topic.
According to Zhang Qiangfeng, Xingyi Research Group has contributed many influential algorithms and computing tools in the field of computational analysis of variable splicing variance analysis for a large number of sequencing data for many years. The rMATS software for differential splicing analysis, developed by the team for high-throughput RNA-seq data, achieves good results for data sets with deeper sequencing and better quality, and has been widely downloaded and used worldwide.
However, due to cost and other reasons, a large number of RNA-seq sequencing experiments were designed with relatively shallow sequencing depth. For these data sets, there are very limited alternative splicing events that can be used for difference analysis.
Ma Jian, a professor at the School of Computer Science at Carnegie Mellon University in the United States, also said that there are many similar problems in genomics: how to train a machine learning model for specific genomic tagging (such as chromatin structure, transcription factor binding) on existing data and predict it effectively in a new cell line has been achieved. It has become a widespread concern in computational genomics. "The new overall design concept of DARTS is worth learning from many other similar issues."
DARTS Computing Framework
Give the answer to the question
According to Xing Yi's paper published in Nature-Method, DARTS consists of two parts: deep neural network module (DNN) and Bayesian inference module (BHT). Among them, DNN predicts the results of differential splicing based on cis-sequence characteristics and sample-specific RNA binding protein expression level characteristics, while BHT infers the results of differential splicing by integrating the experimental sample sequencing data itself and the prior probability based on deep neural network.
Unlike other methods, DNN not only predicts the results of variable splicing by cis-sequence characteristics, but also integrates the expression level of RNA-binding proteins in samples into the prediction of RNA variable splicing results, increasing the dimension of prediction parameters.
The logic of DARTS is that through DNN's in-depth learning of a large number of RNA-seq results in ENCODE and Roadmap databases, high-precision prediction values can be obtained as Bayesian prior probability in BHT, and then combined with the results of RNA-seq in specific experiments, more accurate differential splicing inference can be obtained.
In practice, Xingyi's team found that in low-throughput RNA-seq libraries, enhanced analysis using DNN predictions can achieve higher accuracy than traditional methods, and the improvement is more obvious in low-throughput libraries; even in high-throughput RNA-seq libraries, DNN prediction is still used. Variable splicing changes in low-expression genes can be detected. In the past, the alternative splicing of these low-expression genes was often neglected in traditional analysis methods.
That is to say, the results of the study prove that DARTS not only improves the accuracy of variable splicing based on RNA-seq method, but also provides a means to study variable splicing in low-expression genes.
Analyzing DARTS:
Massive data training and synthesis of new sample features
"Strategies for designing from computational methods
Please read the Chinese version for details.