- Home
- A-Z Publications
- Current Bioinformatics
- Previous Issues
- Volume 17, Issue 3, 2022
Current Bioinformatics - Volume 17, Issue 3, 2022
Volume 17, Issue 3, 2022
-
-
Machine Learning and Deep Learning Strategies in Drug Repositioning
Authors: Fei Wang, Yulian Ding, Xiujuan Lei, Bo Liao and Fang-Xiang WuDrug repositioning invovles exploring novel usages for existing drugs. It plays an important role in drug discovery, especially in the pre-clinical stages. Compared with the traditional drug discovery approaches, computational approaches can save time and reduce cost significantly. Since drug repositioning relies on existing drug-, disease-, and target-centric data, many machine learning (ML) approaches have been proposed to extract useful information from multiple data resources. Deep learning (DL) is a subset of ML and appears in drug repositioning much later than basic ML. Nevertheless, DL methods have shown great performance in predicting potential drugs in many studies. In this article, we review the commonly used basic ML and DL approaches in drug repositioning. Firstly, the related databases are introduced, while all of them are publicly available for researchers. Two types of preprocessing steps, calculating similarities and constructing networks based on those data, are discussed. Secondly, the basic ML and DL strategies are illustrated separately. Thirdly, we review the latest studies focused on the applications of basic ML and DL in identifying potential drugs through three paths: drug-disease associations, drug-drug interactions, and drug-target interactions. Finally, we discuss the limitations in current studies and suggest several directions of future work to address those limitations.
-
-
-
BDselect: A Package for k-mer Selection Based on the Binomial Distribution
Authors: Fu-Ying Dao, Hao Lv, Zhao-Yue Zhang and Hao LinBackground: Dimension disaster is often associated with feature extraction. The extracted features may contain more redundant feature information, which leads to the limitation of computing ability and overfitting problems. Objective: Feature selection is an important strategy to overcome the problems from dimension disaster. In most machine learning tasks, features determine the upper limit of the model performance. Therefore, more and more feature selection methods should be developed to optimize redundant features. Methods: In this paper, we introduce a new technique to optimize sequence features based on the Binomial Distribution (BD). Firstly, the principle of the binomial distribution algorithm is introduced in detail. Then, the proposed algorithm is compared with other commonly used feature selection methods on three different types of datasets by using a Random Forest classifier with the same parameters. Results: The results confirm that BD has a promising improvement in feature selection and classification accuracy. Conclusion: Finally, we provide the source code and executable program package (http: //lingroup. cn/server/BDselect/), by which users can easily perform our algorithm in their researches.
-
-
-
Deep Learning Model for Protein Disease Classification
More LessBackground: Protein sequence analysis helps in the prediction of protein functions. As the number of proteins increases, it gives the bioinformaticians a challenge to analyze and study the similarity between them. Most of the existing protein analysis methods use Support Vector Machine. Deep learning did not receive much attention regarding protein analysis as it is noted that little work focused on studying the protein diseases classification. Objective: The contribution of this paper is to present a deep learning approach that classifies protein diseases based on protein descriptors. Methods: Different protein descriptors are used and decomposed into modified feature descriptors. Uniquely, we introduce using the Convolutional Neural Network model to learn and classify protein diseases. The modified feature descriptors are fed to the Convolutional Neural Network model on a dataset of 1563 protein sequences classified into 3 different disease classes: AIDS, Tumor suppressor, and Proto-oncogene. Results: The usage of the modified feature descriptors shows a significant increase in the performance of the Convolutional Neural Network model over Support Vector Machine using different kernel functions. One modified feature descriptor improved by 19.8%, 27.9%, 17.6%, 21.5%, 17.3%, and 22% for evaluation metrics: Area Under the Curve, Matthews Correlation Coefficient, Accuracy, F1-score, Recall, and Precision, respectively. Conclusion: Results show that the prediction of the proposed CNN model trained by modified feature descriptors significantly surpasses that of Support Vector Machine model.
-
-
-
Diabetes Induced Factors Prediction Based on Various Improved Machine Learning Methods
Authors: Jun Wu, Lulu Qu, Guoping Yang and Nan HanBackground: With the increasing quality of life of people, people have begun to have more time and energy to pay attention to their own health problems. Among them, diabetes, as one of the most common and fastest-growing diseases, has attracted widespread attention from experts in bioinformatics. People of different ages all over the world suffer from diabetes, which can shorten the life span of patients. Diabetes has a significant impact on human health, so that the accuracy of the initial diagnosis becomes essential. Diabetes can bring some serious complications, especially in the elderly, such as cardiovascular and cerebrovascular diseases, stroke, and multiple organ damage. The initial diagnosis of diabetes can reduce the possibility of deterioration. Identifying and analyzing potential risk factors for different physical attributes can help diagnose the prevalence of diabetes. The more accurate the prevalence, the more likely it is to reduce the incidence of complications. Methods: In this paper, we use the open source NHANES data set to analyze and determine potential risk factors relevant to diabetes by an improved version of Logistic Regression, SVM, and other improved machine learning algorithms. Results: Experimental results show that the improved version of Random Forest has the best effect, with a classification accuracy of 92%, and it can be found that age, blood-related diabetes, high blood pressure, cholesterol and BMI are the most important risk factors related to diabetes. Conclusion: Through the proposed method of machine learning, we can cope with class imbalance and outlier detection problems.
-
-
-
Association Analysis Between Introns and mRNAs in Caenorhabditis elegans Genes with Different Expression Levels
Authors: Yanjuan Cao, Qiang Zhang, Zuwei Yan and Xiaoqing ZhaoBackground: Introns are ubiquitous in pre-mRNA but are often overlooked. They also play an important role in the regulation of gene expression. Objective and Methods: We mainly use the improved Smith-Waterman local alignment approach to compare the optimal matching regions between introns and mRNA sequences in Caenorhabditis elegans (C. elegans) genes with high and low expression. Results: We found that the relative matching frequency distributions of all genes lie exactly between highly and lowly expressed genes, indicating that introns in highly and lowly expressed genes have different biological functions. Highly expressed genes have higher matching strengths on mRNA sequences than genes expressed at lower levels; the remarkably matched regions appear in UTR regions, particularly in the 3'UTR. The optimal matching frequency distributions have obvious differences in functional regions of the translation initiation and termination sites in highly and lowly expressed genes. The mRNA sequences with CpG islands tend to have stronger relative matching frequency distributions, especially in highly expressed genes. Additionally, the sequence characteristics of the optimal matched segments are consistent with those of the miRNAs, and they are considered a type of functional RNA segment. Conclusion: Introns in highly and lowly expressed genes contribute to the recognition translation initiation sites and translation termination sites. Moreover, our results suggest that the potential matching relationships between introns and mRNA sequences in highly and lowly expressed genes are significantly different and indicate that the matching strength correlates with the ability of introns to enhance gene expression.
-
-
-
ConvChrome: Predicting Gene Expression Based on Histone Modifications Using Deep Learning Techniques
Authors: Rania Hamdy, Fahima A. Maghraby and Yasser M.K. OmarBackground: Gene regulation is a complex and dynamic process that not only depends on the DNA sequence of genes but is also influenced by a key factor called epigenetic mechanisms. This factor, along with other factors, contributes to changing the behavior of DNA. While these factors cannot affect the structure of DNA, they can control the behavior of DNA by turning genes "on" or "off," which determines which proteins are transcribed. Objectives: This paper will focus on the histone modification mechanism; histones are the group of proteins that bundle the DNA into a structural form called nucleosomes (coils); The way these histone proteins wrap DNA determines whether or not a gene can be accessed for expression. When histones are tightly bound to DNA, the gene is unable to be expressed, and vice versa. It is important to know histone modifications’ combinatorial patterns and how these combinatorial patterns can affect and work together to control the process of gene expression. Methods: In this paper, ConvChrome deep learning methodologies are proposed for predicting the gene expression behavior from histone modifications data as an input to use more than one convolutional network model; this happens in order to recognize patterns of histones signals and interpret their spatial relationship on chromatin structure to give insights into regulatory signatures of histone modifications. Results and Conclusion: The results show that ConvChrome achieved an Area Under the Curve (AUC) score of 88.741%, which is an outstanding improvement over the baseline for gene expression classification prediction task from combinatorial interactions among five histone modifications on 56 different cell types.
-
-
-
Integrated Bioinformatics and Machine Learning Algorithms Analyses Highlight Related Pathways and Genes Associated with Alzheimer's Disease
Authors: Hui Zhang, Qidong Liu, Xiaoru Sun, Yaru Xu, Yiling Fang, Silu Cao, Bing Niu and Cheng LiBackground: The pathophysiology of Alzheimer's Disease (AD) is still not fully studied. Objective: This study aimed to explore the differently expressed key genes in AD and build a predictive model of diagnosis and treatment. Methods: Gene expression data of the entorhinal cortex of AD, asymptomatic AD, and control samples from the GEO database were analyzed to explore the relevant pathways and key genes in the progression of AD. Differentially expressed genes between AD and the other two groups in the module were selected to identify biological mechanisms in AD through KEGG and PPI network analysis in Metascape. Furthermore, genes with a high connectivity degree by PPI network analysis were selected to build a predictive model using different machine learning algorithms. Besides, model performance was tested with five-fold cross-validation to select the best fitting model. Results: A total of 20 co-expression gene clusters were identified after the network was constructed. Module 1 (in black) and module 2 (in royal blue) were most positively and negatively correlated with AD, respectively. Total 565 genes in module 1 and 215 genes in module 2, respectively, overlapped in two differentially expressed genes lists. They were enriched in the G protein-coupled receptor signaling pathway, immune-related processes, and so on. 11 genes were screened by using lasso logistic regression, and they were considered to play an important role in predicting AD samples. The model built by the support vector machine algorithm with 11 genes showed the best performance. Conclusion: This result shed light on the diagnosis and treatment of AD.
-
-
-
DILI-Stk: An Ensemble Model for the Prediction of Drug-induced Liver Injury of Drug Candidates
Authors: Jingyu Lee, Myeong-Sang Yu and Dokyun NaBackground: Drug-induced Liver Injury (DILI) is a leading cause of drug failure, accounting for nearly 20% of drug withdrawal. Thus, there has been a great demand for in silico DILI prediction models for successful drug discovery. To date, various models have been developed for DILI prediction; however, building an accurate model for practical use in drug discovery remains challenging. Methods: We constructed an ensemble model composed of three high-performance DILI prediction models to utilize the unique advantage of each machine learning algorithm. Results: The ensemble model exhibited high predictive performance, with an area under the curve of 0.88, sensitivity of 0.83, specificity of 0.77, F1-score of 0.82, and accuracy of 0.80. When a test dataset collected from the literature was used to compare the performance of our model with publicly available DILI prediction models, our model achieved an accuracy of 0.77, sensitivity of 0.82, specificity of 0.72, and F1-score of 0.79, which were higher than those of the other DILI prediction models. As many published DILI prediction models are not available for public access, which hinders in silico drug discovery, we made our DILI prediction model publicly accessible (http://ssbio.cau.ac.kr/software/dili/). Conclusion: We expect that our ensemble model may facilitate advancements in drug discovery by providing a highly predictive model and reducing the drug withdrawal rate.
-
Volumes & issues
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)