- Home
- A-Z Publications
- Current Bioinformatics
- Previous Issues
- Volume 17, Issue 5, 2022
Current Bioinformatics - Volume 17, Issue 5, 2022
Volume 17, Issue 5, 2022
-
-
Differential Gene Expression in Cancer: An Overrated Analysis
Authors: Jessica Carballido and Rocío CecchiniThe search for marker genes associated with different pathologies traditionally begins with some form of differential expression analysis. This step is essential in most functional genomics' works that analyze gene expression data. In the present article, we present a different analysis, starting from the known biological significance of different groups of genes and then assessing the proportion of differentially expressed genes. The analysis is performed in the context of cancer expression data to unveil the true importance of differential expression, approaching it from different research objectives. Firstly, it was seen that the percentage of differentially expressed genes is generally low concerning gene sets annotated in KEGG. On the other hand, it was observed that in the training and prediction process of both statistical and machine learning models, the fact of using differentially expressed genes sustainably improves their results.
-
-
-
Extended XOR Algorithm with Biotechnology Constraints for Data Security in DNA Storage
More LessBackground: DNA storage is becoming a global research hotspot in recent years, and today, most research focuses on storage density and big data. The security of DNA storage needs to be observed. Some DNA-based security methods were introduced for traditional information security problems. However, few encryption algorithms considered the limitation of biotechnology and applied it for DNA storage. The difference between DNA cryptography and the traditional one is that the former is based on the limitation of biotechnology, which is unrelated to numeracy. Objective: An extended XOR algorithm (EXA) was introduced for encryption with constraints of biotechnology, which can solve the problems of synthesis and sequencing partly, such as GC content and homopolymer in DNA storage. Methods: The target file was converted by a quaternary DNA storage model to maximize the storage efficiency. The key file could be ‘anything’ converted into a DNA sequence by a binary DNA storage model to make the best utilization for the length of the key file. Results: The input files were encrypted into DNA storage and decrypted to error-free output files. Conclusion: This means error-free encryption DNA storage is feasible, and EXA paves the way for encryption in large-scale DNA storage.
-
-
-
SEMCM: A Self-Expressive Matrix Completion Model for Anti-cancer Drug Sensitivity Prediction
Authors: Lin Zhang, Yuwei Yuan, Jian Yu and Hui LiuBackground: Genomic data sets generated by several recent large scale high-throughput screening efforts pose a complex computational challenge for anticancer drug sensitivity prediction. Objective: We aimed to design an algorithm model that would predict missing elements in incomplete matrices and could be applicable to drug response prediction programs. Methods: We developed a novel self-expressive matrix completion model to improve the predictive performance of drug response prediction problems. The model is based on the idea of subspace clustering and as a convex problem, it can be solved by alternating direction method of multipliers. The original incomplete matrix can be filled through model training and parameters updated iteratively. Results: We applied SEMCM to Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) datasets to predict unknown response values. A large number of experiments have proved that the algorithm has good prediction results and stability, which are better than several existing advanced drug sensitivity prediction and matrix completion algorithms. Without modeling mutation information, SEMCM could correctly predict cell line-drug associations for mutated cell lines and wild cell lines. SEMCM can also be used for drug repositioning. The newly predicted drug responses of GDSC dataset suggest that TI-73 was sensitive to Erlotinib. Moreover, the sensitivity of A172 and NCIH1437 to Paclitaxel was roughly the same. Conclusion: We report an efficient anticancer drug sensitivity prediction algorithm which is opensource and can predict the unknown responses of cancer cell lines to drugs. Experimental results prove that our method can not only improve the prediction accuracy but also can be applied to drug repositioning.
-
-
-
COVID-19 Biomarkers Recognition & Classification Using Intelligent Systems
Background: SARS-CoV-2 has paralyzed mankind due to its high transmissibility and its associated mortality, causing millions of infections and deaths worldwide. The search for gene expression biomarkers from the host transcriptional response to infection may help understand the underlying mechanisms by which the virus causes COVID-19. This research proposes a smart methodology integrating different RNA-Seq datasets from SARS-CoV-2, other respiratory diseases, and healthy patients. Methods: The proposed pipeline exploits the functionality of the ‘KnowSeq’ R/Bioc package, integrating different data sources and attaining a significantly larger gene expression dataset, thus endowing the results with higher statistical significance and robustness in comparison with previous studies in the literature. A detailed preprocessing step was carried out to homogenize the samples and build a clinical decision system for SARS-CoV-2. It uses machine learning techniques such as feature selection algorithm and supervised classification system. This clinical decision system uses the most differentially expressed genes among different diseases (including SARS-Cov-2) to develop a four-class classifier. Results: The multiclass classifier designed can discern SARS-CoV-2 samples, reaching an accuracy equal to 91.5%, a mean F1-Score equal to 88.5%, and a SARS-CoV-2 AUC equal to 94% by using only 15 genes as predictors. A biological interpretation of the gene signature extracted reveals relations with processes involved in viral responses. Conclusion: This work proposes a COVID-19 gene signature composed of 15 genes, selected after applying the feature selection ‘minimum Redundancy Maximum Relevance’ algorithm. The integration among several RNA-Seq datasets was a success, allowing for a considerable large number of samples and therefore providing greater statistical significance to the results than in previous studies. Biological interpretation of the selected genes was also provided.
-
-
-
DSAE-Impute: Learning Discriminative Stacked Autoencoders for Imputing Single-cell RNA-seq Data
Authors: Shengfeng Gan, Huan Deng, Yang Qiu, Mohammed Alshahrani and Shichao LiuBackground: Due to the limited amount of mRNA in single-cell, there are always many missing values in scRNA-seq data, making it impossible to accurately quantify the expression of singlecell RNA. The dropout phenomenon makes it impossible to detect the truly expressed genes in some cells, which greatly affects the downstream analysis of scRNA-seq data, such as cell cluster analysis and cell development trajectories. Objective: This research proposes an accurate deep learning method to impute the missing values in scRNA-seq data. DSAE-Impute employs stacked autoencoders to capture gene expression characteristics in the original missing data and combines the discriminative correlation matrix between cells to capture global expression features during the training process to accurately predict missing values. Methods: We propose a novel deep learning model based on the discriminative stacked autoencoders to impute the missing values in scRNA-seq data, named DSAE-Impute. DSAE-Impute embeds the discriminative cell similarity to perfect the feature representation of stacked autoencoders and comprehensively learns the scRNA-seq data expression pattern through layer-by-layer training to achieve accurate imputation. Results: We have systematically evaluated the performance of DSAE-Impute in the simulation and real datasets. The experimental results demonstrate that DSAE-Impute significantly improves downstream analysis, and its imputation results are more accurate than other state-of-the-art imputation methods. Conclusion: Extensive experiments show that compared with other state-of-the-art methods, the imputation results of DSAE-Impute on simulated and real datasets are more accurate and helpful for downstream analysis.
-
-
-
m5C-HPromoter: An Ensemble Deep Learning Predictor for Identifying 5-methylcytosine Sites in Human Promoters
Authors: Xuan Xiao, Yu-Tao Shao, Zhen-Tao Luo and Wang-Ren QiuAims: This paper is intended to identify 5-methylcytosine sites in human promoters. Background: Aberrant DNA methylation patterns are often associated with tumor development. Moreover, hypermethylation inhibits the expression of tumor suppressor genes, and hypomethylation stimulates the expression of certain oncogenes. Most DNA methylation occurs on the CpGisland of the gene promoter region. Objective: Therefore, a comprehensive assessment of methylation status of the promoter region of human gene is extremely important for understanding cancer pathogenesis and the function of posttranscriptional modification. Methods: This paper constructed three human promoter methylation datasets, which comprise of a total of 3 million sample sequences of small cell lung cancer, non-small cell lung cancer, and hepatocellular carcinoma from the Cancer Cell Line Encyclopedia (CCLE) database. Frequency-based One-Hot Encoding was used to encode the sample sequence, and an innovative stacking-based ensemble deep learning classifier was applied to establish the m5C-HPromoter predictor. Results: Taking the average of 10 times of 5-fold cross-validation, m5C-HPromoter obtained a good result in terms of Accuracy (Acc)=0.9270, Matthew's correlation coefficient(MCC)=0.7234, Sensitivity( Sn)=0.9123, and Specificity(Sp)=0.9290. Conclusion: Numerical experiments showed that the proposed m5C-HPromoter has greatly improved the prediction performance compared to the existing iPromoter-5mC predictor. The primary reason is that frequency-based One-Hot encoding solves the too-long and sparse features problems of One-Hot encoding and effectively reflects the sequence feature of DNA sequences. The second reason is that the combination of upsampling and downsampling has achieved great success in solving the imbalance problem. The third reason is the stacking-based ensemble deep learning model that overcomes the shortcomings of various models and has the strengths of various models. The user-friendly web-server m5C-HPromoter is freely accessible to the public at the website: http://121.36.221.79/m5C-HPromoter or http://bioinfo.jcu.edu.cn/m5C-HPromoter, and the predictor program has been uploaded from the website: https://github.com/liujin66/m5C-HPromoter.
-
-
-
Promising Novel Biomarkers and Candidate Drugs or Herbs in Osteoarthritis: Evidence from Bioinformatics Analysis of High-throughput Data
Authors: Linghui Qiao, Jie Han, Guancheng Wang, Tao Yuan and Yanglin GuBackground: The most common joint illness is osteoarthritis (OA). The goal of this study was to find changes in gene signatures between normal knee joints and OA tissue samples and look for prospective gene targets for OA. Methods: The gene expression profiles of GSE12021, GSE51588, and GSE55457 were downloaded from Gene Expression Omnibus (GEO). A total of 64 samples (40 OA and 24 standard control samples) were used. The limma program was used to find differentially expressed genes (DEGs) in OA versus NC. Functional annotation and protein-protein interaction (PPI) network construction of OA-specific DEGs were performed. Finally, the candidate drugs and herbs as potential drugs to treat OA were predicted in the DGIdb and TCMIO databases. Results: A total of 19 upregulated and 27 downregulated DEGs between OA and NC samples were identified. DEGs, such as PTN, COMP, NELL1, and MN1, have shown a significant correlation with OA and are expected to become new biomarkers. Cellular senescence, positive regulation of ossification, and Vascular endothelial growth factor (VEGF) were significantly enriched for OA-specific DEGs. In cell composition analysis, DEGs were also found to be highly enriched in the cytosol. We identified a total of 68 types of drugs or molecular compounds that are promising to reverse OA-related DEGs. Honeycomb and cinnamon oil have the possibility of treating OA. Conclusion: Our findings suggest new biomarkers that can be used to diagnose OA. Furthermore, we tried to find drugs and traditional Chinese medicine that may improve the progress of OA. This research may improve the identification and treatment of these uncontrollable chronic diseases.
-
-
-
Distance-based Support Vector Machine to Predict DNA N6- methyladenine Modification
Authors: Haoyu Zhang, Quan Zou, Ying Ju, Chenggang Song and Dong ChenBackground: DNA N6-methyladenine plays an important role in the restriction-modification system to isolate invasion from adventive DNA. The shortcomings of the high time consumption and high costs of experimental methods have been exposed, and some computational methods have emerged. The support vector machine theory has received extensive attention in the bioinformatics field due to its solid theoretical foundation and many good characteristics. Objective: General machine learning methods include an important step of extracting features. The research has omitted this step and replaced with easy-to-obtain sequence distances matrix to obtain better results. Methods: First sequence alignment technology was used to achieve the similarity matrix. Then, a novel transformation turned the similarity matrix into a distance matrix. Next, the similarity-distance matrix was made positive semi-definite so that it can be used in the kernel matrix. Finally, the LIBSVM software was applied to solve the support vector machine. Results: The five-fold cross-validation of this model on rice and mouse data has achieved excellent accuracy rates of 92.04% and 96.51%, respectively. This shows that the DB-SVM method has obvious advantages over traditional machine learning methods. Meanwhile, this model achieved 0.943,0.982 and 0.818 accuracy; 0.944, 0.982, and 0.838 Matthews correlation coefficient; and 0.942, 0.982 and 0.840 F1 scores for the rice, M. musculus and cross-species genome datasets, respectively. Conclusion: These outcomes show that this model outperforms the iIM-CNN and csDMA in the prediction of DNA 6mA modification, which is the latest research finding on DNA 6mA.
-
Volumes & issues
-
Volume 19 (2024)
-
Volume 18 (2023)
-
Volume 17 (2022)
-
Volume 16 (2021)
-
Volume 15 (2020)
-
Volume 14 (2019)
-
Volume 13 (2018)
-
Volume 12 (2017)
-
Volume 11 (2016)
-
Volume 10 (2015)
-
Volume 9 (2014)
-
Volume 8 (2013)
-
Volume 7 (2012)
-
Volume 6 (2011)
-
Volume 5 (2010)
-
Volume 4 (2009)
-
Volume 3 (2008)
-
Volume 2 (2007)
-
Volume 1 (2006)