- Home
- A-Z Publications
- Current Bioinformatics
- Previous Issues
- Volume 15, Issue 10, 2020
Current Bioinformatics - Volume 15, Issue 10, 2020
Volume 15, Issue 10, 2020
-
-
Salient Features, Data and Algorithms for MicroRNA Screening from Plants: A Review on the Gains and Pitfalls of Machine Learning Techniques
Authors: Garima Ayachit, Inayatullah Shaikh, Himanshu Pandya and Jayashankar DasThe era of big data and high-throughput genomic technology has enabled scientists to have a clear view of plant genomic profiles. However, it has also led to a massive need for computational tools and strategies to interpret this data. In this scenario of huge data inflow, machine learning (ML) approaches are emerging to be the most promising for analysing heterogeneous and unstructured biological datasets. Extending its application to healthcare and agriculture, ML approaches are being useful for microRNA (miRNA) screening as well. Identification of miRNAs is a crucial step towards understanding post-transcriptional gene regulation and miRNA-related pathology. The use of ML tools is becoming indispensable in analysing such data and identifying species-specific, non-conserved miRNA. However, these techniques have their own benefits and lacunas. In this review, we will discuss the current scenario and pitfalls of ML-based tools for plant miRNA identification and provide some insights into the important features, the need for deep learning models and direction in which studies are needed.
-
-
-
Comparisons of MicroRNA Set Enrichment Analysis Tools on Cancer De-regulated miRNAs from TCGA Expression Datasets
Authors: Jianwei Li, Leibo Liu, Qinghua Cui and Yuan ZhouBackground: De-regulation of microRNAs (miRNAs) is closely related to many complex diseases, including cancers. In The Cancer Genome Atlas (TCGA), hundreds of differentially expressed miRNAs are stored for each type of cancer, which are hard to be intuitively interpreted. To date, several miRNA set enrichment tools have been tailored to predict the potential disease associations and functions of de-regulated miRNAs, including the miRNA Enrichment Analysis and Annotation tool (miEAA) and Tool for Annotations of human MiRNAs (TAM 1.0 & TAM 2.0). However, independent benchmarking of these tools is warranted to assess their effectiveness and robustness, and the relationship between enrichment analysis results and the prognosis significance of cancers. Methods: Based on differentially expressed miRNAs from expression profiles in TCGA, we performed a series of tests and a comprehensive comparison of the enrichment analysis results of miEAA, TAM 1.0 and TAM 2.0. The work focused on the performance of the three tools, disease similarity based on miRNA-disease associations from the enrichment analysis results, the relationship between the overrepresented miRNAs from enrichment analysis results and the prognosis significance of cancers. Results: The main results show that TAM 2.0 is more likely to identify the regulatory disease’s functions of de-regulated miRNA; it is feasible to calculate disease similarity based on enrichment analysis results of TAM 2.0; and there is weak positive correlation between the occurrence frequency of miRNAs in the TAM 2.0 enrichment analysis results and the prognosis significance of the cancer miRNAs. Conclusion: Our comparison results not only provide a reference for biomedical researchers to choose appropriate miRNA set enrichment analysis tools to achieve their purpose but also demonstrate that the degree of overrepresentation of miRNAs could be a supplementary indicator of the disease similarity and the prognostic effect of cancer miRNAs.
-
-
-
A Simple Protein Evolutionary Classification Method Based on the Mutual Relations Between Protein Sequences
Authors: Xiaogeng Wan and Xinying TanBackground: Protein is a kind of important organics in life. It is varied with its sequences, structures and functions. Protein evolutionary classification is one of the popular research topics in computational bioinformatics. Many studies have used protein sequence information to classify the evolutionary relationships of proteins. As the amount of protein sequence data increases, efficient computational tools are needed to make efficient protein evolutionary classifications with high accuracies in the big data paradigm. Methods: In this study, we propose a new simple and efficient computational approach based on the normalized mutual information rates to compute the relationship between protein sequences, we then use the “distances” defined on the relationships to perform the evolutionary classifications of proteins. The new method is computational efficient, model-free and unsupervised, which does not require training data when performing classifications. Results: Simulation studies on various examples demonstrate the efficiency of the new method. We use precision-recall curves to compare the efficiency of our new method with traditional methods, results show that the new method outperforms the traditional methods in most of the cases when performing evolutionary classifications. Conclusion: The new method is simple and proved to be efficient in protein evolutionary classifications, which is useful in future evolutionary analysis particularly in the big data paradigm.
-
-
-
Classification of Chromosomal DNA Sequences Using Hybrid Deep Learning Architectures
Authors: Zhihua Du, Xiangdong Xiao and Vladimir N. UverskyBackground: Chromosomal DNA contains most of the genetic information of eukaryotes and plays an important role in the growth, development and reproduction of living organisms. Most chromosomal DNA sequences are known to wrap around histones, and distinguishing these DNA sequences from ordinary DNA sequences is important for understanding the genetic code of life. The main difficulty behind this problem is the feature selection process. DNA sequences have no explicit features, and the common representation methods, such as onehot coding, introduced the major drawback of high dimensionality. Recently, deep learning models have been proved to be able to automatically extract useful features from input patterns. Objective: We aim to investigate which deep learning networks could achieve notable improvements in the field of DNA sequence classification using only sequence information. Methods: In this paper, we present four different deep learning architectures using convolutional neural networks and long short-term memory networks for the purpose of chromosomal DNA sequence classification. Natural language model (Word2vec) was used to generate word embedding of sequence and learn features from it by deep learning. Results: The comparison of these four architectures is carried out on 10 chromosomal DNA datasets. The results show that the architecture of convolutional neural networks combined with long short-term memory networks is superior to other methods with regards to the accuracy of chromosomal DNA prediction. Conclusion: In this study, four deep learning models were compared for an automatic classification of chromosomal DNA sequences with no steps of sequence preprocessing. In particular, we have regarded DNA sequences as natural language and extracted word embedding with Word2Vec to represent DNA sequences. Results show a superiority of the CNN+LSTM model in the ten classification tasks. The reason for this success is that the CNN module captures the regulatory motifs, while the following LSTM layer captures the long-term dependencies between them.
-
-
-
Robust Transcription Factor Binding Site Prediction Using Deep Neural Networks
Authors: Kanu Geete and Manish PandeyAim: Robust and more accurate method for identifying transcription factor binding sites (TFBS) for gene expression. Background: Deep neural networks (DNNs) have shown promising growth in solving complex machine learning problems. Conventional techniques are comfortably replaced by DNNs in computer vision, signal processing, healthcare, and genomics. Understanding DNA sequences is always a crucial task in healthcare and regulatory genomics. For DNA motif prediction, choosing the right dataset with a sufficient number of input sequences is crucial in order to design an effective model. Objective: Designing a new algorithm which works on different dataset while an improved performance for TFBS prediction. Methods: With the help of Layerwise Relevance Propagation, the proposed algorithm identifies the invariant features with adaptive noise patterns. Results: The performance is compared by calculating various metrics on standard as well as recent methods and significant improvement is noted. Conclusion: By identifying the invariant and robust features in the DNA sequences, the classification performance can be increased.
-
-
-
Detecting TYMS Tandem Repeat Polymorphism by the PSSD Method Based on Next-generation Sequencing
Authors: Binsheng He, Jialiang Yang, Geng Tian, Pingping Bing and Jidong LangBackground: Thymidylate Synthase (TS) is an important target for folic acid inhibitors such as pemetrexed, which has considerable effects on the first-line treatment, second-line treatment and maintenance therapy for patients with late-stage Non-Small Cell Lung Cancer (NSCLC). Therefore, detecting mutations in the TYMS gene encoding TS is critical in clinical applications. With the development of Next-Generation Sequencing (NGS) technology, the accuracy of TYMS mutation detection is getting higher and higher. However, traditional methods suffer from false positives and false-negatives caused by factors like limited sequencing read length and sequencing errors. Objective: A method was needed to overcome the short sequencing read length and sequencing errors of NGS to make the detection of TYMS more accurate. Methods: In this study, we developed a novel method based on "Paired Seed Sequence Distance” (PSSD) to detect the Variable Number of Tandem Repeat (VNTR) mutation for TYMS. Results: With the 121 samples validated by sanger, the consistency rate of PSSD method was 85.95% (104/121), higher than the strict matching method (78.51% (95/121)). The consistency rate of the two methods was 89.26% (108/121). We also found that the PSSD method was significantly better than the strict matching method, especially in the 4R typing. Conclusion: Our method not only improves the detection rate and accuracy of TYMS VNTR mutations but also avoids problems caused by sequencing errors and limited sequencing length. This method provides a new solution for similar polymorphism analyses and other sequencing analyses.
-
-
-
Whole-exome Sequencing of Tumor-only Samples Reveals the Association between Somatic Alterations and Clinical Features in Pancreatic Cancer
Background: Identification of genomic markers using NGS (next-generation sequencing) technology would be valuable for guiding precision medicine treatments for pancreatic cancers. Traditional somatic mutation methods require both tumor and matched non-tumor samples. However, only tumor samples are available mostly, especially in retrospective studies. In this study, we tried to analyze the associations between clinical features and oncogenic somatic mutations in genome-wide tumor-only samples. Methods: Fifty-four tumor-only samples derived from pancreatic cancer patients were used for whole-exome sequencing. An approach involving SNP filtering of variants included in the Catalogue of Somatic Mutations in Cancer (COSMIC) database was used to identify oncogenic somatic mutations. The relationships between oncogenic mutations and clinical features were analyzed and simultaneously compared with those from the TCGA database. Results: By analyzing the mutations from tumor only samples, divergent mutation profiles were observed in different locations (head vs. body/tail) of pancreatic tumors. The divergences between pancreatic head and body/tail cancers were also confirmed by the TCGA data. Furthermore, mutations of several genes were found to be significantly associated with clinical features, such as pathological stage and the degree of tumor differentiation. Conclusion: The results confirmed the efficiency of our approach in identifying oncogenic somatic mutations from tumor only samples and revealed the associations between somatic mutations and clinical features in pancreatic cancer.
-
-
-
IsoDetect: Detection of Splice Isoforms from Third Generation Long Reads Based on Short Feature Sequences
Authors: Hong-Dong Li, Wenjing Zhang, Yuwen Luo and Jianxin WangBackground: Transcriptome annotation is the basis for understanding gene structures and analysing gene expression. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of annotated isoforms. Objective: We aim to develop a method to detect isoforms by incorporating annotated isoforms. Methods: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junctions is extracted from annotated isoforms as “short feature sequences”, which is used to distinguish splice isoforms. Second, we align these feature sequences to long reads and partition long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Therefore, our method can detect not only known but also novel isoforms. Results: Tested on two datasets from Calypte anna and Zebra Finch, IsoDetect shows higher speed and good accuracies compared with four existing methods. Conclusion: IsoDetect may become a promising method for isoform detection.
-
-
-
Using Bioinformatics to Quantify the Variability and Diversity of the Microbial Community Structure in Pond Ecosystems of a Subtropical Catchment
Authors: Jiaogen Zhou, Yang Wang and Qiuliang LeiBackground: In rural China, many natural water bodies and farmlands have been converted into fish farming ponds as an economic developmental strategy. There is still a limited understanding of how the diversity and structure of microbial communities change in nature and become managed fish pond ecosystems. Objective: We aimed to identify the changes of the diversity and structure of microbial community and driving mechanism in pond ecosystems. Methods: The datasets of 16S rRNA amplicon sequencing and the concentrations of N and P fractions were achieved in water samples of pond ecosystems. Bioinformatics analysis was used to analyze the diversity and structure of the microbial communities. Results: Our results indicated that the diversity and structure of the microbial communities in the natural ponds were significantly different from ones in managed fish ponds. The nutrients of N and P and water environmental factors were responsible for 46.3% and 19.5% of the changes in the structure and diversity of the microbial community, respectively. Conclusion: The N and P fractions and water environmental factors influenced the microbial community structure and diversity in pond ecosystems. Fish farming indirectly affected the microbial community by altering the contents of N and P fractions in water bodies of ponds, when a natural pond was converted into a managed fish pond.
-
-
-
Integrative Analysis of miRNA-mediated Competing Endogenous RNA Network Reveals the lncRNAs-mRNAs Interaction in Glioblastoma Stem Cell Differentiation
Authors: Zhenyu Zhao, Cheng Zhang, Mi Li, Xinguang Yu, Hailong Liu, Qi Chen, Jian Wang, Shaopin Shen and Jingjing JiangBackground: Competing endogenous RNA (ceRNA) networks play a pivotal role in tumor diagnosis and progression. Numerous studies have explored the functional landscape and prognostic significance of ceRNA interaction within differentiated tumor cells. Objective: We propose a new perspective by exploring ceRNA networks in the process of glioblastoma stem cell (GSC) differentiation. Methods: In this study, expression profiles of lncRNAs and mRNAs were compared between GSCs and differentiated glioblastoma cells. Using a comprehensive computational method, miRNAmediated and GSC differentiation-associated ceRNA crosstalk between lncRNAs and mRNAs was identified. A ceRNA network was then established to select potential candidates that regulate GSC differentiation. Results: Based on the specific ceRNA network related to GSC differentiation, we identified lnc MYOSLID: 11 as a ceRNA that regulated the expression of the downstream gene PXN by competitively binding with hsa-miR-149-3p. After Kaplan-Meier (KM) survival analysis, the expression of PXN gene (PPXN = 0.0015) and lnc MYOSLID: 11 (PMYOSLID: 11=0.041) showed significant correlation with glioblastoma in 160 patients from TCGA. Conclusion: This result sheds light on a potential way of studying the ceRNA network, which can provide clues for developing new diagnostic methods and finding therapeutic targets for clinical treatment of glioblastoma.
-
-
-
Identification of Most Relevant Features for Classification of Francisella tularensis using Machine Learning
Background: Francisella tularensis is a stealth pathogen fatal for animals and humans. Ease of its propagation, coupled with high capacity for ailment and death makes it a potential candidate for biological weapon. Objective: Work related to the pathogen’s classification and factors affecting its prolonged existence in soil is limited to statistical measures. Machine learning other than conventional analysis methods may be applied to better predict epidemiological modeling for this soil-borne pathogen. Methods: Feature-ranking algorithms namely; relief, correlation and oneR are used for soil attribute ranking. Moreover, classification algorithms; SVM, random forest, naive bayes, logistic regression and MLP are used for classification of the soil attribute dataset for Francisella tularensis positive and negative soils. Results: Feature-ranking methods concluded that clay, nitrogen, organic matter, soluble salts, zinc, silt and nickel are the most significant attributes while potassium, phosphorous, iron, calcium, copper, chromium and sand are the least contributing risk factors for the persistence of the pathogen. However, clay is the most significant and potassium is the least contributing attribute. Data analysis suggests that feature-ranking using relief produced classification accuracy of 84.35% for multilayer perceptron; 82.99% for linear regression; 80.27% for SVM and random forest; and 78.23% for naive bayes, which is better than other ranking methods. MLP outperforms other classifiers by generating an accuracy of 84.35%, 82.99% and 81.63% for feature-ranking using relief, correlation and oneR algorithms, respectively. Conclusion: These models can significantly improve accuracy and can minimize the risk of incorrect classification. They further help in controlling epidemics and thereby minimizing the socio-economic impact on the society.
-
-
-
MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction
More LessAims: The study aims to find a way to reduce the dimensionality of the dataset. Background: Dimensionality reduction is the key issue of the machine learning process. It does not only improve the prediction performance but also could recommend the intrinsic features and help to explore the biological expression of the machine learning “black box”. Objective: A variety of feature selection algorithms are used to select data features to achieve dimensionality reduction. Methods: First, MRMD2.0 integrated 7 different popular feature ranking algorithms with PageRank strategy. Second, optimized dimensionality was detected with forward adding strategy. Result: We have achieved good results in our experiments. Conclusion: Several works have been tested with MRMD2.0. It showed well performance. Otherwise, it also can draw the performance curves according to the feature dimensionality. If users want to sacrifice accuracy for fewer features, they can select the dimensionality from the performance curves. Other: We developed friendly python tools together with the web server. The users could upload their csv, arff or libsvm format files. Then the webserver would help to rank features and find the optimized dimensionality.
-
-
-
Genome-wide Identification of Differently Expressed lncRNAs, mRNAs, and circRNAs in Patients with Osteoarthritis
Authors: Yeqing Sun, Lei Chen, Yingqi Zhang, Jincheng Zhang and Shashi R. TiwariBackground: Osteoarthritis (OA), one of the most important causes leading to joint disability, was considered as an untreatable disease. A series of genes were reported to regulate the pathogenesis of OA, including microRNAs, Long non-coding RNAs and Circular RNA. So far, the expression profiles and functions of lncRNAs, mRNAs, and circRNAs in OA are not fully understood. Objective: The present study aimed to identify differentially expressed genes in OA. Methods: The present study conducted RNA-seq to identify differentially expressed genes in OA. Ontology (GO) analysis was used to analyze the Molecular Function and Biological Process. KEGG pathway analysis was used to perform the differentially expressed lncRNAs in biological pathways. Results: Hierarchical clustering revealed a total of 943 mRNAs, 518 lncRNAs, and 300 circRNAs, which were dysregulated in OA compared to normal samples. Furthermore, we constructed differentially expressed mRNAs mediated protein-protein interaction network, differentially expressed lncRNAs mediated trans-regulatory networks, and competitive endogenous RNA (ceRNA) to reveal the interaction among these genes in OA. Bioinformatics analysis revealed that these dysregulated genes were involved in regulating multiple biological processes, such as wound healing, negative regulation of ossification, sister chromatid cohesion, positive regulation of interleukin-1 alpha production, sodium ion transmembrane transport, positive regulation of cell migration, and negative regulation of inflammatory response. To the best of our knowledge, this study for the first time, revealed the expression pattern of mRNAs, lncRNAs and circRNAs in OA. Conclusion: This study provided novel information to validate these differentially expressed RNAs may be as possible biomarkers and targets in OA.
-
Volumes & issues
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)