Volume 15, Issue 10

Current Bioinformatics - Volume 15, Issue 10, 2020

Volume 15, Issue 10, 2020

- Salient Features, Data and Algorithms for MicroRNA Screening from Plants: A Review on the Gains and Pitfalls of Machine Learning Techniques
  
  Authors: Garima Ayachit, Inayatullah Shaikh, Himanshu Pandya and Jayashankar Das
  
  https://doi.org/10.2174/1574893615999200601121756
  More Less
  
  The era of big data and high-throughput genomic technology has enabled scientists to have a clear view of plant genomic profiles. However, it has also led to a massive need for computational tools and strategies to interpret this data. In this scenario of huge data inflow, machine learning (ML) approaches are emerging to be the most promising for analysing heterogeneous and unstructured biological datasets. Extending its application to healthcare and agriculture, ML approaches are being useful for microRNA (miRNA) screening as well. Identification of miRNAs is a crucial step towards understanding post-transcriptional gene regulation and miRNA-related pathology. The use of ML tools is becoming indispensable in analysing such data and identifying species-specific, non-conserved miRNA. However, these techniques have their own benefits and lacunas. In this review, we will discuss the current scenario and pitfalls of ML-based tools for plant miRNA identification and provide some insights into the important features, the need for deep learning models and direction in which studies are needed.
  
  Add to my favourites
  
  Email this

- Comparisons of MicroRNA Set Enrichment Analysis Tools on Cancer De-regulated miRNAs from TCGA Expression Datasets
  
  Authors: Jianwei Li, Leibo Liu, Qinghua Cui and Yuan Zhou
  
  https://doi.org/10.2174/1574893615666200224095041
  More Less
  
  Background: De-regulation of microRNAs (miRNAs) is closely related to many complex diseases, including cancers. In The Cancer Genome Atlas (TCGA), hundreds of differentially expressed miRNAs are stored for each type of cancer, which are hard to be intuitively interpreted. To date, several miRNA set enrichment tools have been tailored to predict the potential disease associations and functions of de-regulated miRNAs, including the miRNA Enrichment Analysis and Annotation tool (miEAA) and Tool for Annotations of human MiRNAs (TAM 1.0 & TAM 2.0). However, independent benchmarking of these tools is warranted to assess their effectiveness and robustness, and the relationship between enrichment analysis results and the prognosis significance of cancers. Methods: Based on differentially expressed miRNAs from expression profiles in TCGA, we performed a series of tests and a comprehensive comparison of the enrichment analysis results of miEAA, TAM 1.0 and TAM 2.0. The work focused on the performance of the three tools, disease similarity based on miRNA-disease associations from the enrichment analysis results, the relationship between the overrepresented miRNAs from enrichment analysis results and the prognosis significance of cancers. Results: The main results show that TAM 2.0 is more likely to identify the regulatory disease’s functions of de-regulated miRNA; it is feasible to calculate disease similarity based on enrichment analysis results of TAM 2.0; and there is weak positive correlation between the occurrence frequency of miRNAs in the TAM 2.0 enrichment analysis results and the prognosis significance of the cancer miRNAs. Conclusion: Our comparison results not only provide a reference for biomedical researchers to choose appropriate miRNA set enrichment analysis tools to achieve their purpose but also demonstrate that the degree of overrepresentation of miRNAs could be a supplementary indicator of the disease similarity and the prognostic effect of cancer miRNAs.
  
  Add to my favourites
  
  Email this

- A Simple Protein Evolutionary Classification Method Based on the Mutual Relations Between Protein Sequences
  
  Authors: Xiaogeng Wan and Xinying Tan
  
  https://doi.org/10.2174/1574893615666200305090055
  More Less
  
  Background: Protein is a kind of important organics in life. It is varied with its sequences, structures and functions. Protein evolutionary classification is one of the popular research topics in computational bioinformatics. Many studies have used protein sequence information to classify the evolutionary relationships of proteins. As the amount of protein sequence data increases, efficient computational tools are needed to make efficient protein evolutionary classifications with high accuracies in the big data paradigm. Methods: In this study, we propose a new simple and efficient computational approach based on the normalized mutual information rates to compute the relationship between protein sequences, we then use the “distances” defined on the relationships to perform the evolutionary classifications of proteins. The new method is computational efficient, model-free and unsupervised, which does not require training data when performing classifications. Results: Simulation studies on various examples demonstrate the efficiency of the new method. We use precision-recall curves to compare the efficiency of our new method with traditional methods, results show that the new method outperforms the traditional methods in most of the cases when performing evolutionary classifications. Conclusion: The new method is simple and proved to be efficient in protein evolutionary classifications, which is useful in future evolutionary analysis particularly in the big data paradigm.
  
  Add to my favourites
  
  Email this

- Classification of Chromosomal DNA Sequences Using Hybrid Deep Learning Architectures
  
  Authors: Zhihua Du, Xiangdong Xiao and Vladimir N. Uversky
  
  https://doi.org/10.2174/1574893615666200224095531
  More Less
  
  Background: Chromosomal DNA contains most of the genetic information of eukaryotes and plays an important role in the growth, development and reproduction of living organisms. Most chromosomal DNA sequences are known to wrap around histones, and distinguishing these DNA sequences from ordinary DNA sequences is important for understanding the genetic code of life. The main difficulty behind this problem is the feature selection process. DNA sequences have no explicit features, and the common representation methods, such as onehot coding, introduced the major drawback of high dimensionality. Recently, deep learning models have been proved to be able to automatically extract useful features from input patterns. Objective: We aim to investigate which deep learning networks could achieve notable improvements in the field of DNA sequence classification using only sequence information. Methods: In this paper, we present four different deep learning architectures using convolutional neural networks and long short-term memory networks for the purpose of chromosomal DNA sequence classification. Natural language model (Word2vec) was used to generate word embedding of sequence and learn features from it by deep learning. Results: The comparison of these four architectures is carried out on 10 chromosomal DNA datasets. The results show that the architecture of convolutional neural networks combined with long short-term memory networks is superior to other methods with regards to the accuracy of chromosomal DNA prediction. Conclusion: In this study, four deep learning models were compared for an automatic classification of chromosomal DNA sequences with no steps of sequence preprocessing. In particular, we have regarded DNA sequences as natural language and extracted word embedding with Word2Vec to represent DNA sequences. Results show a superiority of the CNN+LSTM model in the ten classification tasks. The reason for this success is that the CNN module captures the regulatory motifs, while the following LSTM layer captures the long-term dependencies between them.
  
  Add to my favourites
  
  Email this

- Robust Transcription Factor Binding Site Prediction Using Deep Neural Networks
  
  Authors: Kanu Geete and Manish Pandey
  
  https://doi.org/10.2174/1574893615999200429121156
  More Less
  
  Aim: Robust and more accurate method for identifying transcription factor binding sites (TFBS) for gene expression. Background: Deep neural networks (DNNs) have shown promising growth in solving complex machine learning problems. Conventional techniques are comfortably replaced by DNNs in computer vision, signal processing, healthcare, and genomics. Understanding DNA sequences is always a crucial task in healthcare and regulatory genomics. For DNA motif prediction, choosing the right dataset with a sufficient number of input sequences is crucial in order to design an effective model. Objective: Designing a new algorithm which works on different dataset while an improved performance for TFBS prediction. Methods: With the help of Layerwise Relevance Propagation, the proposed algorithm identifies the invariant features with adaptive noise patterns. Results: The performance is compared by calculating various metrics on standard as well as recent methods and significant improvement is noted. Conclusion: By identifying the invariant and robust features in the DNA sequences, the classification performance can be increased.
  
  Add to my favourites
  
  Email this

- Detecting TYMS Tandem Repeat Polymorphism by the PSSD Method Based on Next-generation Sequencing
  
  Authors: Binsheng He, Jialiang Yang, Geng Tian, Pingping Bing and Jidong Lang
  
  https://doi.org/10.2174/1574893615999200505074805
  More Less
  
  Background: Thymidylate Synthase (TS) is an important target for folic acid inhibitors such as pemetrexed, which has considerable effects on the first-line treatment, second-line treatment and maintenance therapy for patients with late-stage Non-Small Cell Lung Cancer (NSCLC). Therefore, detecting mutations in the TYMS gene encoding TS is critical in clinical applications. With the development of Next-Generation Sequencing (NGS) technology, the accuracy of TYMS mutation detection is getting higher and higher. However, traditional methods suffer from false positives and false-negatives caused by factors like limited sequencing read length and sequencing errors. Objective: A method was needed to overcome the short sequencing read length and sequencing errors of NGS to make the detection of TYMS more accurate. Methods: In this study, we developed a novel method based on "Paired Seed Sequence Distance” (PSSD) to detect the Variable Number of Tandem Repeat (VNTR) mutation for TYMS. Results: With the 121 samples validated by sanger, the consistency rate of PSSD method was 85.95% (104/121), higher than the strict matching method (78.51% (95/121)). The consistency rate of the two methods was 89.26% (108/121). We also found that the PSSD method was significantly better than the strict matching method, especially in the 4R typing. Conclusion: Our method not only improves the detection rate and accuracy of TYMS VNTR mutations but also avoids problems caused by sequencing errors and limited sequencing length. This method provides a new solution for similar polymorphism analyses and other sequencing analyses.
  
  Add to my favourites
  
  Email this

- Whole-exome Sequencing of Tumor-only Samples Reveals the Association between Somatic Alterations and Clinical Features in Pancreatic Cancer
  
  Authors: Wenwen Ran, Xiangbin Chen, Bo Wang, Ping Yang, Yongxing Li, Yujing Xiao, Xiaonan Wang, Guangqi Li, Lili Wang, Yingmin Han, Yonggang Peng, Jidong Lang, Yuebin Liang, Yupei Xiao, Qingqing Lu, Huixin Lin, Geng Tian, Dawei Yuan, Chaoyang Deng, Jialiang Yang and Xiaoming Xing
  
  https://doi.org/10.2174/1574893615999200626190346
  More Less
  
  Background: Identification of genomic markers using NGS (next-generation sequencing) technology would be valuable for guiding precision medicine treatments for pancreatic cancers. Traditional somatic mutation methods require both tumor and matched non-tumor samples. However, only tumor samples are available mostly, especially in retrospective studies. In this study, we tried to analyze the associations between clinical features and oncogenic somatic mutations in genome-wide tumor-only samples. Methods: Fifty-four tumor-only samples derived from pancreatic cancer patients were used for whole-exome sequencing. An approach involving SNP filtering of variants included in the Catalogue of Somatic Mutations in Cancer (COSMIC) database was used to identify oncogenic somatic mutations. The relationships between oncogenic mutations and clinical features were analyzed and simultaneously compared with those from the TCGA database. Results: By analyzing the mutations from tumor only samples, divergent mutation profiles were observed in different locations (head vs. body/tail) of pancreatic tumors. The divergences between pancreatic head and body/tail cancers were also confirmed by the TCGA data. Furthermore, mutations of several genes were found to be significantly associated with clinical features, such as pathological stage and the degree of tumor differentiation. Conclusion: The results confirmed the efficiency of our approach in identifying oncogenic somatic mutations from tumor only samples and revealed the associations between somatic mutations and clinical features in pancreatic cancer.
  
  Add to my favourites
  
  Email this

- IsoDetect: Detection of Splice Isoforms from Third Generation Long Reads Based on Short Feature Sequences
  
  Authors: Hong-Dong Li, Wenjing Zhang, Yuwen Luo and Jianxin Wang
  
  https://doi.org/10.2174/1574893615666200316101205
  More Less
  
  Background: Transcriptome annotation is the basis for understanding gene structures and analysing gene expression. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of annotated isoforms. Objective: We aim to develop a method to detect isoforms by incorporating annotated isoforms. Methods: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junctions is extracted from annotated isoforms as “short feature sequences”, which is used to distinguish splice isoforms. Second, we align these feature sequences to long reads and partition long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Therefore, our method can detect not only known but also novel isoforms. Results: Tested on two datasets from Calypte anna and Zebra Finch, IsoDetect shows higher speed and good accuracies compared with four existing methods. Conclusion: IsoDetect may become a promising method for isoform detection.
  
  Add to my favourites
  
  Email this

- Using Bioinformatics to Quantify the Variability and Diversity of the Microbial Community Structure in Pond Ecosystems of a Subtropical Catchment
  
  Authors: Jiaogen Zhou, Yang Wang and Qiuliang Lei
  
  https://doi.org/10.2174/1574893615999200422120819
  More Less
  
  Background: In rural China, many natural water bodies and farmlands have been converted into fish farming ponds as an economic developmental strategy. There is still a limited understanding of how the diversity and structure of microbial communities change in nature and become managed fish pond ecosystems. Objective: We aimed to identify the changes of the diversity and structure of microbial community and driving mechanism in pond ecosystems. Methods: The datasets of 16S rRNA amplicon sequencing and the concentrations of N and P fractions were achieved in water samples of pond ecosystems. Bioinformatics analysis was used to analyze the diversity and structure of the microbial communities. Results: Our results indicated that the diversity and structure of the microbial communities in the natural ponds were significantly different from ones in managed fish ponds. The nutrients of N and P and water environmental factors were responsible for 46.3% and 19.5% of the changes in the structure and diversity of the microbial community, respectively. Conclusion: The N and P fractions and water environmental factors influenced the microbial community structure and diversity in pond ecosystems. Fish farming indirectly affected the microbial community by altering the contents of N and P fractions in water bodies of ponds, when a natural pond was converted into a managed fish pond.
  
  Add to my favourites
  
  Email this

- Integrative Analysis of miRNA-mediated Competing Endogenous RNA Network Reveals the lncRNAs-mRNAs Interaction in Glioblastoma Stem Cell Differentiation
  
  Authors: Zhenyu Zhao, Cheng Zhang, Mi Li, Xinguang Yu, Hailong Liu, Qi Chen, Jian Wang, Shaopin Shen and Jingjing Jiang
  
  https://doi.org/10.2174/1574893615999200511074226
  More Less
  
  Background: Competing endogenous RNA (ceRNA) networks play a pivotal role in tumor diagnosis and progression. Numerous studies have explored the functional landscape and prognostic significance of ceRNA interaction within differentiated tumor cells. Objective: We propose a new perspective by exploring ceRNA networks in the process of glioblastoma stem cell (GSC) differentiation. Methods: In this study, expression profiles of lncRNAs and mRNAs were compared between GSCs and differentiated glioblastoma cells. Using a comprehensive computational method, miRNAmediated and GSC differentiation-associated ceRNA crosstalk between lncRNAs and mRNAs was identified. A ceRNA network was then established to select potential candidates that regulate GSC differentiation. Results: Based on the specific ceRNA network related to GSC differentiation, we identified lnc MYOSLID: 11 as a ceRNA that regulated the expression of the downstream gene PXN by competitively binding with hsa-miR-149-3p. After Kaplan-Meier (KM) survival analysis, the expression of PXN gene (PPXN = 0.0015) and lnc MYOSLID: 11 (PMYOSLID: 11=0.041) showed significant correlation with glioblastoma in 160 patients from TCGA. Conclusion: This result sheds light on a potential way of studying the ceRNA network, which can provide clues for developing new diagnostic methods and finding therapeutic targets for clinical treatment of glioblastoma.
  
  Add to my favourites
  
  Email this

- Identification of Most Relevant Features for Classification of Francisella tularensis using Machine Learning
  
  Authors: Fareed Ahmad, Amjad Farooq, Muhammad U. Ghani Khan, Muhammad Zubair Shabbir, Masood Rabbani and Irshad Hussain
  
  https://doi.org/10.2174/1574893615666200219113900
  More Less
  
  Background: Francisella tularensis is a stealth pathogen fatal for animals and humans. Ease of its propagation, coupled with high capacity for ailment and death makes it a potential candidate for biological weapon. Objective: Work related to the pathogen’s classification and factors affecting its prolonged existence in soil is limited to statistical measures. Machine learning other than conventional analysis methods may be applied to better predict epidemiological modeling for this soil-borne pathogen. Methods: Feature-ranking algorithms namely; relief, correlation and oneR are used for soil attribute ranking. Moreover, classification algorithms; SVM, random forest, naive bayes, logistic regression and MLP are used for classification of the soil attribute dataset for Francisella tularensis positive and negative soils. Results: Feature-ranking methods concluded that clay, nitrogen, organic matter, soluble salts, zinc, silt and nickel are the most significant attributes while potassium, phosphorous, iron, calcium, copper, chromium and sand are the least contributing risk factors for the persistence of the pathogen. However, clay is the most significant and potassium is the least contributing attribute. Data analysis suggests that feature-ranking using relief produced classification accuracy of 84.35% for multilayer perceptron; 82.99% for linear regression; 80.27% for SVM and random forest; and 78.23% for naive bayes, which is better than other ranking methods. MLP outperforms other classifiers by generating an accuracy of 84.35%, 82.99% and 81.63% for feature-ranking using relief, correlation and oneR algorithms, respectively. Conclusion: These models can significantly improve accuracy and can minimize the risk of incorrect classification. They further help in controlling epidemics and thereby minimizing the socio-economic impact on the society.
  
  Add to my favourites
  
  Email this

- MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction
  
  Authors: Shida He, Fei Guo, Quan Zou and HuiDing
  
  https://doi.org/10.2174/1574893615999200503030350
  More Less
  
  Aims: The study aims to find a way to reduce the dimensionality of the dataset. Background: Dimensionality reduction is the key issue of the machine learning process. It does not only improve the prediction performance but also could recommend the intrinsic features and help to explore the biological expression of the machine learning “black box”. Objective: A variety of feature selection algorithms are used to select data features to achieve dimensionality reduction. Methods: First, MRMD2.0 integrated 7 different popular feature ranking algorithms with PageRank strategy. Second, optimized dimensionality was detected with forward adding strategy. Result: We have achieved good results in our experiments. Conclusion: Several works have been tested with MRMD2.0. It showed well performance. Otherwise, it also can draw the performance curves according to the feature dimensionality. If users want to sacrifice accuracy for fewer features, they can select the dimensionality from the performance curves. Other: We developed friendly python tools together with the web server. The users could upload their csv, arff or libsvm format files. Then the webserver would help to rank features and find the optimized dimensionality.
  
  Add to my favourites
  
  Email this

- Genome-wide Identification of Differently Expressed lncRNAs, mRNAs, and circRNAs in Patients with Osteoarthritis
  
  Authors: Yeqing Sun, Lei Chen, Yingqi Zhang, Jincheng Zhang and Shashi R. Tiwari
  
  https://doi.org/10.2174/1574893615999200706002907
  More Less
  
  Background: Osteoarthritis (OA), one of the most important causes leading to joint disability, was considered as an untreatable disease. A series of genes were reported to regulate the pathogenesis of OA, including microRNAs, Long non-coding RNAs and Circular RNA. So far, the expression profiles and functions of lncRNAs, mRNAs, and circRNAs in OA are not fully understood. Objective: The present study aimed to identify differentially expressed genes in OA. Methods: The present study conducted RNA-seq to identify differentially expressed genes in OA. Ontology (GO) analysis was used to analyze the Molecular Function and Biological Process. KEGG pathway analysis was used to perform the differentially expressed lncRNAs in biological pathways. Results: Hierarchical clustering revealed a total of 943 mRNAs, 518 lncRNAs, and 300 circRNAs, which were dysregulated in OA compared to normal samples. Furthermore, we constructed differentially expressed mRNAs mediated protein-protein interaction network, differentially expressed lncRNAs mediated trans-regulatory networks, and competitive endogenous RNA (ceRNA) to reveal the interaction among these genes in OA. Bioinformatics analysis revealed that these dysregulated genes were involved in regulating multiple biological processes, such as wound healing, negative regulation of ossification, sister chromatid cohesion, positive regulation of interleukin-1 alpha production, sodium ion transmembrane transport, positive regulation of cell migration, and negative regulation of inflammatory response. To the best of our knowledge, this study for the first time, revealed the expression pattern of mRNAs, lncRNAs and circRNAs in OA. Conclusion: This study provided novel information to validate these differentially expressed RNAs may be as possible biomarkers and targets in OA.
  
  Add to my favourites
  
  Email this

- Acknowledgement to Reviewers
  
  https://doi.org/10.2174/157489361510201224092240
  More Less
  
  Add to my favourites
  
  Email this

Most Cited Most Cited RSS feed

- A Review of Ensemble Methods in Bioinformatics
  
  Authors: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya
- Bioinformatics Tools for Mass Spectroscopy-Based Metabolomic Data Processing and Analysis
  
  Authors: Masahiro Sugimoto, Masato Kawakami, Martin Robert, Tomoyoshi Soga and Masaru Tomita
- Distance-based Support Vector Machine to Predict DNA N6- methyladenine Modification
  
  Authors: Haoyu Zhang, Quan Zou, Ying Ju, Chenggang Song and Dong Chen
- A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods
  
  Authors: Jun Zhang and Bin Liu
- Molecular Genetic Markers: Discovery, Applications, Data Storage and Visualisation
  
  Authors: Chris Duran, Nikki Appleby, David Edwards and Jacqueline Batley
- A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization
  
  Authors: Wuritu Yang, Xiao-Juan Zhu, Jian Huang, Hui Ding and Hao Lin
- Cancer Diagnosis Through IsomiR Expression with Machine Learning Method
  
  Authors: Zhijun Liao, Dapeng Li, Xinrui Wang, Lisheng Li and Quan Zou
- Relevance of Molecular Docking Studies in Drug Designing
  
  Authors: Ritu Jakhar, Mehak Dangi, Alka Khichi and Anil K. Chhillar
- The Advances and Challenges of Deep Learning Application in Biological Big Data Processing
  
  Authors: Li Peng, Manman Peng, Bo Liao, Guohua Huang, Weibiao Li and Dingfeng Xie
- Gene Expression Profile Classification: A Review
  
  Authors: Musa H. Asyali, Dilek Colak, Omer Demirkaya and Mehmet S. Inan
More Less

Current Bioinformatics - Volume 15, Issue 10, 2020

Volume 15, Issue 10, 2020

Volumes & issues

Most Read This Month

Most Cited Most Cited RSS feed