Mathematics and Statistics
Application of Deep Learning Neural Networks in Computer-Aided Drug Discovery: A Review
Computer-aided drug design has an important role in drug development and design. It has become a thriving area of research in the pharmaceutical industry to accelerate the drug discovery process. Deep learning a subdivision of artificial intelligence is widely applied to advance new drug development and design opportunities. This article reviews the recent technology that uses deep learning techniques to ameliorate the understanding of drug-target interactions in computer-aided drug discovery based on the prior knowledge acquired from various literature. In general deep learning models can be trained to predict the binding affinity between the protein-ligand complexes and protein structures or generate protein-ligand complexes in structure-based drug discovery. In other words artificial neural networks and deep learning algorithms especially graph convolutional neural networks and generative adversarial networks can be applied to drug discovery. Graph convolutional neural network effectively captures the interactions and structural information between atoms and molecules which can be enforced to predict the binding affinity between protein and ligand. Also the ligand molecules with the desired properties can be generated using generative adversarial networks.
Integration of Artificial Intelligence, Machine Learning and Deep Learning Techniques in Genomics: Review on Computational Perspectives for NGS Analysis of DNA and RNA Seq Data
In the current state of genomics and biomedical research the utilization of Artificial Intelligence (AI) Machine Learning (ML) and Deep Learning (DL) have emerged as paradigm shifters. While traditional NGS DNA and RNA sequencing analysis pipelines have been sound in decoding genetic information the sequencing data's volume and complexity have surged. There is a demand for more efficient and accurate methods of analysis. This has led to dependency on AI/ML and DL approaches. This paper highlights these tool approaches to ease combat the limitations and generate better results with the help of pipeline automation and integration of these tools into the NGS DNA and RNA-seq pipeline we can improve the quality of research as large data sets can be processed using Deep Learning tools. Automation helps reduce labor-intensive tasks and helps researchers to focus on other frontiers of research. In the traditional pipeline all tasks from quality check to the variant identification in the case of SNP detection take a huge amount of computational time and manually the researcher has to input codes to prevent manual human errors but with the power of automation we can run the whole process in comparatively lesser time and smoother as the automated pipeline can run for multiple files instead of the one single file observed in the traditional pipeline. In conclusion this review paper sheds light on the transformative impact of DL's integration into traditional pipelines and its role in optimizing computational time. Additionally it highlights the growing importance of AI-driven solutions in advancing genomics research and enabling data-intensive biomedical applications.
Prospects of Identifying Alternative Splicing Events from Single-Cell RNA Sequencing Data
Background: The advent of single-cell RNA sequencing (scRNA-seq) technology has offered unprecedented opportunities to unravel cellular heterogeneity and functions. Yet despite its success in unraveling gene expression heterogeneity accurately identifying and interpreting alternative splicing events from scRNA-seq data remains a formidable challenge. With advancing technology and algorithmic innovations the prospect of accurately identifying alternative splicing events from scRNA-seq data is becoming increasingly promising. Objective: This perspective aims to uncover the intricacies of splicing at the single-cell level and their potential implications for health and disease. It seeks to harness scRNA-seq's transformative power in revealing cell-specific alternative splicing dynamics and aims to propel our understanding of gene regulation within individual cells to new heights. Methods: The perspective grounds its method on recent literature along with the experimental protocols of single-cell RNA-seq and methods to identify and quantify the alternative splicing events from scRNA-seq data. Results: This perspective outlines the promising potential challenges and methodologies for leveraging different scRNA-seq technologies to identify and study alternative splicing events with a focus on advancing our understanding of gene regulation at the single-cell level. Conclusion: This perspective explores the prospects of utilizing scRNA-seq data to identify and study alternative splicing events highlighting their potential challenges methodologies biological insights and future directions.
MSSD: An Efficient Method for Constructing Accurate and Stable Phylogenetic Networks by Merging Subtrees of Equal Depth
Background: Systematic phylogenetic networks are essential for studying the evolutionary relationships and diversity among species. These networks are particularly important for capturing non-tree-like processes resulting from reticulate evolutionary events. However existing methods for constructing phylogenetic networks are influenced by the order of inputs. The different orders can lead to inconsistent experimental results. Moreover constructing a network for large datasets is time-consuming and the network often does not include all of the input tree nodes. Aims: This paper aims to propose a novel method called as MSSD which can construct a phylogenetic network from gene trees by Merging Subtrees with the Same Depth in a bottom-up way. Methods: The MSSD first decomposes trees into subtrees based on depth. Then it merges subtrees with the same depth from 0 to the maximum depth. For all subtrees of one depth it inserts each subtree into the current networks by means of identical subtrees. Results: We test the MSSD on the simulated data and real data. The experimental results show that the networks constructed by the MSSD can represent all input trees and the MSSD is more stable than other methods. The MSSD can construct networks faster and the constructed networks have more similar information with the input trees than other methods. Conclusion: MSSD is a powerful tool for studying the evolutionary relationships among species in biologyand is free available at https://github.com/xingjiajie2023/MSSD.
P4PC: A Portal for Bioinformatics Resources of piRNAs and circRNAs
Background: PIWI-interacting RNAs (piRNAs) and circular RNAs (circRNAs) are two kinds of non-coding RNAs (ncRNAs) that play important roles in epigenetic regulation transcriptional regulation post-transcriptional regulation of many biological processes. Although there exist various resources it is still challenging to select such resources for specific research projects on ncRNAs. Methods: In order to facilitate researchers in finding the appropriate bioinformatics sources for studying ncRNAs we created a novel portal named P4PC that provides computational tools and data sources of piRNAs and circRNAs. Results: 249 computational tools 126 databases and 420 papers are manually curated in P4PC. All entries in P4PC are classified in 5 groups and 26 subgroups. The list of resources is summarized in the first page of each group. Conclusion: According to their research proposes users can quickly select proper resources for their research projects by viewing detail information and comments in P4PC. Database URL is http://www.ibiomedical.net/Portal4PC/ and https://43.138.46.5/Portal4PC/.
Prediction of Drug Pathway-based Disease Classes using Multiple Properties of Drugs
Background: Drug repositioning now is an important research area in drug discovery as it can accelerate the procedures of discovering novel effects of existing drugs. However it is challenging to screen out possible effects for given drugs. Designing computational methods are a quick and cheap way to complete this task. Most existing computational methods infer the relationships between drugs and diseases. The pathway-based disease classification reported in KEGG provides us a new way to investigate drug repositioning as such classification can be applied to drugs. A predicted class of a given drug suggests latent diseases it can treat. Objective: The purpose of this study is to set up efficient multi-label classifiers to predict the classes of drugs. Methods: We adopt three types of drug information to generate drug features including drug pathway information label information and drug network. For the first two types drugs are first encoded into binary vectors which are further processed by singular value decomposition. For the third type the network embedding algorithm Mashup is employed to yield drug features. Above features are combined and fed into RAndom k-labELsets (RAKEL) to construct multi-label classifiers where support vector machine is selected as the base classification algorithm. Results: The ten-fold cross-validation results show that the classifiers provide high performance with accuracy higher than 0.95 and absolute true higher than 0.92. The case study indicates the novel effects of three drugs i.e. they may treat new diseases. Conclusion: The proposed classifiers have high performance and are superiority to the classifiers with other classic algorithms and drug information. Furthermore they have the ability to discover new effects of drugs.
DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers
Introduction: More recent self-supervised deep language models such as Bidirectional Encoder Representations from Transformers (BERT) have performed the best on some language tasks by contextualizing word embeddings for a better dynamic representation. Their proteinspecific versions such as ProtBERT generated dynamic protein sequence embeddings which resulted in better performance for several bioinformatics tasks. Besides a number of different protein post-translational modifications are prominent in cellular tasks such as development and differentiation. The current biological experiments can detect these modifications but within a longer duration and with a significant cost. Methods: In this paper to comprehend the accompanying biological processes concisely and more rapidly we propose DEEPPTM to predict protein post-translational modification (PTM) sites from protein sequences more efficiently. Different than the current methods DEEPPTM enhances the modification prediction performance by integrating specialized ProtBERT-based protein embeddings with attention-based vision transformers (ViT) and reveals the associations between different modification types and protein sequence content. Additionally it can infer several different modifications over different species. Results: Human and mouse ROC AUCs for predicting Succinylation modifications were 0.793 and 0.661 respectively once 10-fold cross-validation is applied. Similarly we have obtained 0.776 0.764 and 0.734 ROC AUC scores on inferring ubiquitination crotonylation and glycation sites respectively. According to detailed computational experiments DEEPPTM lessens the time spent in laboratory experiments while outperforming the competing methods as well as baselines on inferring all 4 modification sites. In our case attention-based deep learning methods such as vision transformers look more favorable to learning from ProtBERT features than more traditional deep learning and machine learning techniques. Conclusion: Additionally the protein-specific ProtBERT model is more effective than the original BERT embeddings for PTM prediction tasks. Our code and datasets can be found at https://github.com/seferlab/deepptm.
FMDVSerPred: A Novel Computational Solution for Foot-and-mouth Disease Virus Classification and Serotype Prediction Prevalent in Asia Using VP1 Nucleotide Sequence Data
Background: Three serotypes of Foot-and-mouth disease (FMD) virus have been circulating in Asia which are commonly identified by serological assays. Such tests are timeconsuming and also need a bio-containment facility for execution. To the best of our knowledge no computational solution is available in the literature to predict the FMD virus serotypes. Thus this necessitates the urgent need for user-friendly tools for FMD virus serotyping. Methods: We presented a computational solution based on a machine-learning model for FMD virus classification and serotype prediction. Besides various data pre-processing techniques are implemented in the approach for better model prediction. We used sequence data of 2509 FMD virus isolates reported from India and seven other Asian FMD-endemic countries for model training testing and validation. We also studied the utility of the developed computational solution in a wet lab setup through collecting and sequencing of 12 virus isolates reported in India. Here the computational solution is implemented in two user-friendly tools i.e. online web-prediction server (https://nifmd-bbf.icar.gov.in/FMDVSerPred) and R statistical software package (https://github.com/sam-dfmd/FMDVSerPred). Results: The random forest machine learning model is implemented in the computational solution as it outperformed seven other machine learning models when evaluated on ten test and independent datasets. Furthermore the developed computational solution provided validation accuracies of up to 99.87% on test data up to 98.64% and 90.24% on independent data reported from Asian countries including India and its seven neighboring countries respectively. In addition our approach was successfully used for predicting serotypes of field FMD virus isolates reported from various parts of India. Conclusion: The high-throughput sequencing combined with machine learning offers a promising solution to FMD virus serotyping.
Optimized Hybrid Deep Learning for Real-Time Pandemic Data Forecasting: Long and Short-Term Perspectives
Background: With new variants of COVID-19 causing challenges we need to focus on integrating multiple deep-learning frameworks to develop intelligent healthcare systems for early detection and diagnosis. Objective: This article suggests three hybrid deep learning models namely CNN-LSTM CNN-Bi- LSTM and CNN-GRU to address the pressing need for an intelligent healthcare system. These models are designed to capture spatial and temporal patterns in COVID-19 data thereby improving the accuracy and timeliness of predictions. An output forecasting framework integrates these models and an optimization algorithm automatically selects the hyperparameters for the 13 baselines and the three proposed hybrid models. Methods: Real-time time series data from the five most affected countries were used to test the effectiveness of the proposed models. Baseline models were compared and optimization algorithms were employed to improve forecasting capabilities. Results: CNN-GRU and CNN-LSTM are the top short- and long-term forecasting models. CNNGRU had the best performance with the lowest SMAPE and MAPE values for long-term forecasting in India at 3.07% and 3.17% respectively and impressive results for short-term forecasting with SMAPE and MAPE values of 1.46% and 1.47%. Conclusion: Hybrid deep learning models like CNN-GRU can aid in early COVID-19 assessment and diagnosis. They detect patterns in data for effective governmental strategies and forecasting. This helps manage and mitigate the pandemic faster and more accurately.
Inferring Gene Regulatory Networks from Single-Cell Time-Course Data Based on Temporal Convolutional Networks
Background: Time-course single-cell RNA sequencing (scRNA-seq) data represent dynamic gene expression values that change over time which can be used to infer causal relationships between genes and construct dynamic gene regulatory networks (GRNs). However most of the existing methods are designed for bulk RNA sequencing (bulk RNA-seq) data and static scRNA-seq data and only a few methods such as CNNC and DeepDRIM can be directly applied to time-course scRNA-seq data. Objective: This work aims to infer causal relationships between genes and construct dynamic gene regulatory networks using time-course scRNA-seq data. Methods: We propose an analytical method for inferring GRNs from single-cell time-course data based on temporal convolutional networks (scTGRN) which provides a supervised learning approach to infer causal relationships among genes. scTGRN constructs a 4D tensor representing gene expression features for each gene pair then inputs the constructed 4D tensor into the temporal convolutional network to train and infer the causal relationship between genes. Results: We validate the performance of scTGRN on five real datasets and four simulated datasets and the experimental results show that scTGRN outperforms existing models in constructing GRNs. In addition we test the performance of scTGRN on gene function assignment and scTGRN outperforms other models. Conclusion: The analysis shows that scTGRN can not only accurately identify the causal relationship between genes but also can be used to achieve gene function assignment.
A Novel Natural Graph for Efficient Clustering of Virus Genome Sequences
Background: This study addresses the need for analyzing viral genome sequences and understanding their genetic relationships. The focus is on introducing a novel natural graph approach as a solution. Objective: The objective of this study is to demonstrate the effectiveness and advantages of the proposed natural graph approach in clustering viral genome sequences into distinct clades subtypes or districts. Additionally the aim is to explore its interpretability potential applications and implications for pandemic control and public health interventions. Methods: The study utilizes the proposed natural graph algorithm to cluster viral genome sequences. The results are compared with existing methods and multidimensional scaling to evaluate the performance and effectiveness of the approach. Results: The natural graph approach successfully clusters viral genome sequences providing valuable insights into viral evolution and transmission dynamics. The ability to generate directed connections between nodes enhances the interpretability of the results facilitating the investigation of transmission pathways and viral fitness. Conclusion: The findings highlight the potential applications of the natural graph algorithm in pandemic control transmission tracing and vaccine design. Future research directions may involve scaling up the analysis to larger datasets and incorporating additional genetic features for improved resolution. The natural graph approach presents a promising tool for viral genomics research with implications for public health interventions.
Identification of Spatial Domains, Spatially Variable Genes, and Genetic Association Studies of Alzheimer Disease with an Autoencoder-based Fuzzy Clustering Algorithm
Introduction: Transcriptional gene expressions and their corresponding spatial information are critical for understanding the biological function mutual regulation and identification of various cell types. Materials and Methods: Recently several computational methods have been proposed for clustering using spatial transcriptional expression. Although these algorithms have certain practicability they cannot utilize spatial information effectively and are highly sensitive to noise and outliers. In this study we propose ACSpot an autoencoder-based fuzzy clustering algorithm as a solution to tackle these problems. Specifically we employed a self-supervised autoencoder to reduce feature dimensionality mitigate nonlinear noise and learn high-quality representations. Additionally a commonly used clustering method Fuzzy c-means is used to achieve improved clustering results. In particular we utilize spatial neighbor information to optimize the clustering process and to fine-tune each spot to its associated cluster category using probabilistic and statistical methods. Result and Discussion: The comparative analysis on the 10x Visium human dorsolateral prefrontal cortex (DLPFC) dataset demonstrates that ACSpot outperforms other clustering algorithms. Subsequently spatially variable genes were identified based on the clustering outcomes revealing a striking similarity between their spatial distribution and the subcluster spatial distribution from the clustering results. Notably these spatially variable genes include APP PSEN1 APOE SORL1 BIN1 and PICALM all of which are well-known Alzheimer's disease-associated genes. Conclusion: In addition we applied our model to explore some potential Alzheimer's disease correlated genes within the dataset and performed Gene Ontology (GO) enrichment and gene-pathway analyses for validation illustrating the capability of our model to pinpoint genes linked to Alzheimer's disease.
Identification of Mitophagy-Related Genes in Sepsis
Background: Numerous studies have shown that mitochondrial damage induces inflammation and activates inflammatory cells leading to sepsis while sepsis a systemic inflammatory response syndrome also exacerbates mitochondrial damage and hyperactivation. Mitochondrial autophagy eliminates aged abnormal or damaged mitochondria to reduce intracellular mitochondrial stress and the release of mitochondria-associated molecules thereby reducing the inflammatory response and cellular damage caused by sepsis. In addition mitochondrial autophagy may also influence the onset and progression of sepsis but the exact mechanisms are unclear. Methods: In this study we mined the available publicly available microarray data in the GEO database (Home - GEO - NCBI (nih.gov)) with the aim of identifying key genes associated with mitochondrial autophagy in sepsis. Results: We identified four mitophagy-related genes in sepsis TOMM20 TOMM22 TOMM40 and MFN1. Conclusion: This study provides preliminary evidence for the treatment of sepsis and may provide a solid foundation for subsequent biological studies.
Network Subgraph-based Method: Alignment-free Technique for Molecular Network Analysis
Background: Comparing directed networks using the alignment-free technique offers the advantage of detecting topologically similar regions that are independent of the network size or node identity. Objective: We propose a novel method to compare directed networks by decomposing the network into small modules the so-called network subgraph approach which is distinct from the network motif approach because it does not depend on null model assumptions. Method: We developed an alignment-free algorithm called the Subgraph Identification Algorithm (SIA which could generate all subgraphs that have five connected nodes (5-node subgraph). There were 9364 such modules. Then we applied the SIA method to examine 17 cancer networks and measured the similarity between the two networks by gauging the similarity level using Jensen-Shannon entropy (HJS). Results: We identified and examined the biological meaning of 5-node regulatory modules and pairs of cancer networks with the smallest HJS values. The two pairs of networks that show similar patterns are (i) endometrial cancer and hepatocellular carcinoma and (ii) breast cancer and pathways in cancer. Some studies have provided experimental data supporting the 5-node regulatory modules. Conclusion: Our method is an alignment-free approach that measures the topological similarity of 5-node regulatory modules and aligns two directed networks based on their topology. These modules capture complex interactions among multiple genes that cannot be detected using existing methods that only consider single-gene relations. We analyzed the biological relevance of the regulatory modules and used the subgraph method to identify the modules that shared the same topology across 2 cancer networks out of 17 cancer networks. We validated our findings using evidence from the literature.
Transformer-based Named Entity Recognition for Clinical Cancer Drug Toxicity by Positive-unlabeled Learning and KL Regularizers
Background: With increasing rates of polypharmacy the vigilant surveillance of clinical drug toxicity has emerged as an important concern. Named Entity Recognition (NER) stands as an indispensable undertaking essential for the extraction of valuable insights regarding drug safety from the biomedical literature. In recent years significant advancements have been achieved in the deep learning models on NER tasks. Nonetheless the effectiveness of these NER techniques relies on the availability of substantial volumes of annotated data which is labor-intensive and inefficient. Methods: This study introduces a novel approach that diverges from the conventional reliance on manually annotated data. It employs a transformer-based technique known as Positive-Unlabeled Learning (PULearning) which incorporates adaptive learning and is applied to the clinical cancer drug toxicity corpus. To improve the precision of prediction we employ relative position embeddings within the transformer encoder. Additionally we formulate a composite loss function that integrates two Kullback-Leibler (KL) regularizers to align with PULearning assumptions. The outcomes demonstrate that our approach attains the targeted performance for NER tasks solely relying on unlabeled data and named entity dictionaries. Conclusion: Our model achieves an overall NER performance with an F1 of 0.819. Specifically it attains F1 of 0.841 0.801 and 0.815 for DRUG CANCER and TOXI entities respectively. A comprehensive analysis of the results validates the effectiveness of our approach in comparison to existing PULearning methods on biomedical NER tasks. Additionally a visualization of the associations among three identified entities is provided offering a valuable reference for querying their interrelationships.
Metabolomics: Recent Advances and Future Prospects Unveiled
In the era of genomics fueled by advanced technologies and analytical tools metabolomics has become a vital component in biomedical research. Its significance spans various domains encompassing biomarker identification uncovering underlying mechanisms and pathways as well as the exploration of new drug targets and precision medicine. This article presents a comprehensive overview of the latest developments in metabolomics techniques emphasizing their wide-ranging applications across diverse research fields and underscoring their immense potential for future advancements.
Stacking-Kcr: A Stacking Model for Predicting the Crotonylation Sites of Lysine by Fusing Serial and Automatic Encoder
Background: Protein lysine crotonylation (Kcr) a newly discovered important posttranslational modification (PTM) is typically localized at the transcription start site and regulates gene expression which is associated with a variety of pathological conditions such as developmental defects and malignant transformation. Objective: Identifying Kcr sites is advantageous for the discovery of its biological mechanism and the development of new drugs for related diseases. However traditional experimental methods for identifying Kcr sites are expensive and inefficient necessitating the development of new computational techniques. Methods: In this work to accurately identify Kcr sites we propose a model for ensemble learning called Stacking-Kcr. Firstly extract features from sequence information physicochemical properties and sequence fragment similarity. Then the two characteristics of sequence information and physicochemical properties are fused using automatic encoder and serial respectively. Finally the fused two features and sequence fragment similarity features are then respectively input into the four base classifiers a meta classifier is constructed using the first level prediction results and the final forecasting results are obtained. Results: The five-fold cross-validation of this model has achieved an accuracy of 0.828 and an AUC of 0.910. This shows that the Stacking-Kcr method has obvious advantages over traditional machine learning methods. On independent test sets Stacking-Kcr achieved an accuracy of 84.89% and an AUC of 92.21% which was higher than 1.7% and 0.8% of other state-of-the-art tools. Additionally we trained Stacking-Kcr on the phosphorylation site and the result is superior to the current model. Conclusion: These outcomes are additional evidence that Stacking-Kcr has strong application potential and generalization performance.
Discovering Microbe-disease Associations with Weighted Graph Convolution Networks and Taxonomy Common Tree
Background: Microbe-disease associations are integral to understanding complex diseases and their screening procedures. Objective: While numerous computational methods have been developed to detect these associations their performance remains limited due to inadequate utilization of weighted inherent similarities and microbial taxonomy hierarchy. To address this limitation we have introduced WTHMDA (weighted taxonomic heterogeneous network-based microbe-disease association) a novel deep learning framework. Methods: WTHMDA combines a weighted graph convolution network and the microbial taxonomy common tree to predict microbe-disease associations effectively. The framework extracts multiple microbe similarities from the taxonomy common tree facilitating the construction of a microbe- disease heterogeneous interaction network. Utilizing a weighted DeepWalk algorithm node embeddings in the network incorporate weight information from the similarities. Subsequently a deep neural network (DNN) model accurately predicts microbe-disease associations based on this interaction network. Results: Extensive experiments on multiple datasets and case studies demonstrate WTHMDA's superiority over existing approaches particularly in predicting unknown associations. Conclusion: Our proposed method offers a new strategy for discovering microbe-disease linkages showcasing remarkable performance and enhancing the feasibility of identifying disease risk.
Prediction of Super-enhancers Based on Mean-shift Undersampling
Background: Super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors chromatin regulators or chromatin marks. It has been reported that super-enhancers are transcriptionally more active and cell-type-specific than regular enhancers. Therefore it is necessary to identify super-enhancers from regular enhancers. A variety of computational methods have been proposed to identify super-enhancers as auxiliary tools. However most methods use ChIP-seq data and the lack of this part of the data will make the predictor unable to execute or fail to achieve satisfactory performance. Objective: The aim of this study is to propose a stacking computational model based on the fusion of multiple features to identify super-enhancers in both human and mouse species. Methods: This work adopted mean-shift to cluster majority class samples and selected four sets of balanced datasets for mouse and three sets of balanced datasets for human to train the stacking model. Five types of sequence information are used as input to the XGBoost classifier and the average value of the probability outputs from each classifier is designed as the final classification result. Results: The results of 10-fold cross-validation and cross-cell-line validation prove that our method has superior performance compared to other existing methods. The source code and datasets are available at https://github.com/Cheng-Han-max/SE_voting. Conclusion: The analysis of feature importance indicates that Mismatch accounts for the highest proportion among the top 20 important features.