Skip to content
2000
Volume 20, Issue 2
  • ISSN: 1574-8936
  • E-ISSN: 2212-392X

Abstract

Background

One of the most prevalent epigenetic alterations in all three kingdoms of life is 5mC, which plays a part in a wide range of biological functions. Although techniques are more effective in detecting epigenetic alterations, they are time and cost-intensive. Artificial intelligence-based approaches have been used to overcome these obstacles.

Aim

This study aimed to develop a ML-based predictor for the detection of 5mC sites in Poaceae.

Objective

The objective of this study was the evaluation of machine learning and deep learning models for the prediction of 5mC sites in rice.

Methods

In this study, the vectorization of DNA sequences has been performed using three distinct feature sets- Oligo Nucleotide Frequencies (k = 2), Mono-nucleotide Binary Encoding, and Chemical Properties of Nucleotides. Two deep learning models, long short-term memory (LSTM) and Bidirectional LSTM (Bi-LSTM), as well as nine machine learning models, including random forest, gradient boosting, naïve bayes, regression tree, k-Nearest neighbour, support vector machine, adaboost, multiple logistic regression, and artificial neural network, were investigated. Also, bootstrap resampling was used to build more efficient models along with a hybrid feature selection module for dimensional reduction and removal of irrelevant features of the vector space.

Results

Random Forest gains the maximum accuracy, specificity and MCC, 92.6%, 86.41% and 0.84. Gradient Boosting obtained the maximum sensitivity, 96.85%. The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) technique showed that the best three models were Random Forest, Gradient Boosting, and Support Vector Machine in terms of accurate prediction of 5mC sites in rice. We developed an R-package, ‘GB5mCPred,’ and it is available in CRAN (https://cran.r-project.org/web/packages/GB5mcPred/index.html). Also, a user-friendly prediction server was made based on this algorithm (http://cabgrid.res.in:5474/).

Conclusion

With nearly equal TOPSIS scores, Random Forest, Gradient Boosting, and Support Vector Machine ended up being the best three models. The major rationale may be found in their architectural design since they are gradual learning models that can capture the 5mC sites more correctly than other learning models.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936285544231221113226
2024-04-08
2025-06-23
Loading full text...

Full text loading...

References

  1. WaddingtonC.H. The epigenotype.Int. J. Epidemiol.2012411101310.1093/ije/dyr18422186258
    [Google Scholar]
  2. AshapkinV.V. KutuevaL.I. AleksandrushkinaN.I. VanyushinB.F. Epigenetic mechanisms of plant adaptation to biotic and abiotic stresses.Int. J. Mol. Sci.20202120745710.3390/ijms2120745733050358
    [Google Scholar]
  3. SaraswatS. YadavA.K. SirohiP. SinghN.K. Role of epigenetics in crop improvement: Water and heat stress.J. Plant Biol.201760323124010.1007/s12374‑017‑0053‑8
    [Google Scholar]
  4. HasanM.M. ManavalanB. ShoombuatongW. KhatunM.S. KurataH. i6mA-Fuse: Improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation.Plant Mol. Biol.20201031-222523410.1007/s11103‑020‑00988‑y32140819
    [Google Scholar]
  5. HasanM.M. BasithS. KhatunM.S. LeeG. ManavalanB. KurataH. Meta-i6mA: An interspecies predictor for identifying DNA N 6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework.Brief. Bioinform.2021223bbaa20210.1093/bib/bbaa20232910169
    [Google Scholar]
  6. PomraningK.R. SmithK.M. FreitagM. Genome-wide high throughput analysis of DNA methylation in eukaryotes.Methods200947314215010.1016/j.ymeth.2008.09.02218950712
    [Google Scholar]
  7. ZhouC. WangC. LiuH. ZhouQ. LiuQ. GuoY. PengT. SongJ. ZhangJ. ChenL. ZhaoY. ZengZ. ZhouD.X. Identification and analysis of adenine N6-methylation sites in the rice genome.Nat. Plants20184855456310.1038/s41477‑018‑0214‑x30061746
    [Google Scholar]
  8. ChengX. WangJ. LiQ. LiuT. BiLSTM-5mC: A bidirectional long short-term memory-based approach for predicting 5-methylcytosine sites in genome-wide DNA promoters.Molecules20212624741410.3390/molecules2624741434946497
    [Google Scholar]
  9. MatteiA.L. BaillyN. MeissnerA. DNA methylation: A historical perspective.Trends Genet.202238767670710.1016/j.tig.2022.03.01035504755
    [Google Scholar]
  10. DeichmannU. Epigenetics: The origins and evolution of a fashionable topic.Dev. Biol.2016416124925410.1016/j.ydbio.2016.06.00527291929
    [Google Scholar]
  11. LiY. TollefsbolT.O. DNA methylation detection: Bisulfite genomic sequencing analysis.Methods Mol. Biol.2011791112110.1007/978‑1‑61779‑316‑5_221913068
    [Google Scholar]
  12. BoothM.J. OstT.W.B. BeraldiD. BellN.M. BrancoM.R. ReikW. BalasubramanianS. Oxidative bisulfite sequencing of 5-methylcytosine and 5-hydroxymethylcytosine.Nat. Protoc.20138101841185110.1038/nprot.2013.11524008380
    [Google Scholar]
  13. LiuY. Siejka-ZielińskaP. VelikovaG. BiY. YuanF. TomkovaM. BaiC. ChenL. Schuster-BöcklerB. SongC.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution.Nat. Biotechnol.201937442442910.1038/s41587‑019‑0041‑230804537
    [Google Scholar]
  14. KhoddamiV. CairnsB.R. Transcriptome-wide target profiling of RNA cytosine methyltransferases using the mechanism-based enrichment procedure Aza-IP.Nat. Protoc.20149233736110.1038/nprot.2014.01424434802
    [Google Scholar]
  15. LvH. ZhangZ.M. LiS.H. TanJ.X. ChenW. LinH. Evaluation of different computational methods on 5-methylcytosine sites identification.Brief. Bioinform.202021398299510.1093/bib/bbz04831157855
    [Google Scholar]
  16. NavarezA.M. RoxasR. An evaluation of multitask transfer learning methods in identifying 6mA and 5mC methylation sites of rice and maize.SSRN2022
    [Google Scholar]
  17. NguyenT.T.D. TranT.A. LeN.Q.K. PhamD.M. OuY.Y. An extensive examination of discovering 5-methylcytosine sites in genome-wide DNA promoters using machine learning based approaches.IEEE/ACM Trans Comput Biol Bioinform202219879410.1109/TCBB.2021.3082184
    [Google Scholar]
  18. ZhangL. XiaoX. XuZ.C. iPromoter-5mC: A novel fusion decision predictor for the identification of 5-methylcytosine sites in genome-wide DNA promoters.Front. Cell Dev. Biol.2020861410.3389/fcell.2020.0061432850787
    [Google Scholar]
  19. WangY. ZhangP. GuoW. LiuH. LiX. ZhangQ. DuZ. HuG. HanX. PuL. TianJ. GuX. A deep learning approach to automate whole‐genome prediction of diverse epigenomic modifications in plants.New Phytol.2021232288089710.1111/nph.1763034287908
    [Google Scholar]
  20. RehmanM.U. TayaraH. ZouQ. ChongK.T. i6mA-Caps: A CapsuleNet-based framework for identifying DNA N6-methyladenine sites.Bioinformatics202238163885389110.1093/bioinformatics/btac43435771648
    [Google Scholar]
  21. QiX. FullerE. WuQ. ZhangC.Q. Numerical characterization of DNA sequence based on dinucleotides.ScientificWorldJournal2012201210426910.1100/2012/104269
    [Google Scholar]
  22. SharmaA. SinhaD. MishraD.C. RaiA. LalS.B. KumarS. FarooqiM.S. ChaturvediK.K. MetaConClust - unsupervised binning of metagenomics data using consensus clustering.Curr. Genomics202223213714610.2174/138920292366622041311465936778980
    [Google Scholar]
  23. BasithS. ManavalanB. ShinT.H. LeeG. SDM6A: A web-based integrative machine-learning framework for predicting 6mA sites in the rice genome.Mol. Ther. Nucleic Acids20191813114110.1016/j.omtn.2019.08.01131542696
    [Google Scholar]
  24. ChenW. LvH. NieF. LinH. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome.Bioinformatics201935162796280010.1093/bioinformatics/btz01530624619
    [Google Scholar]
  25. KhaQ.H. TranT.O. NguyenT.T.D. NguyenV.N. ThanK. LeN.Q.K. An interpretable deep learning model for classifying adaptor protein complexes from sequence information.Methods2022207909610.1016/j.ymeth.2022.09.00736174933
    [Google Scholar]
  26. CortesC. VapnikV. SaittaL. Support-vector networks.Mach. Learn.19952027329710.1007/BF00994018
    [Google Scholar]
  27. YangZ.R. Biological applications of support vector machines.Brief. Bioinform.20045432833810.1093/bib/5.4.32815606969
    [Google Scholar]
  28. QuinlanJ.R. Induction of decision trees.Mach. Learn.198618110610.1007/BF00116251
    [Google Scholar]
  29. BreimanL. Random forests.Mach. Learn.2001451. 4553210.1023/A:1010933404324
    [Google Scholar]
  30. TaunkK. DeS. VermaS. SwetapadmaA. A brief review of nearest neighbor algorithm for learning and classification2019 International Conference on Intelligent Computing and Control Systems, ICCS 201920191255126010.1109/ICCS45141.2019.9065747
    [Google Scholar]
  31. GrossiE. BuscemaM. Introduction to artificial neural networks.Eur. J. Gastroenterol. Hepatol.200719121046105410.1097/MEG.0b013e3282f198a017998827
    [Google Scholar]
  32. MaB. MengF. YanG. YanH. ChaiB. SongF. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data.Comput. Biol. Med.202012110376110.1016/j.compbiomed.2020.10376132339094
    [Google Scholar]
  33. FriedmanJ.H. Greedy function approximation: A gradient boosting machine.Ann. Statist.20012951189123210.1214/aos/1013203451
    [Google Scholar]
  34. SinhaD. DasmandalT. YeasinM. MishraD.C. RaiA. ArchakS. EpiSemble: A novel ensemble-based machine-learning framework for prediction of DNA N6-methyladenine sites using hybrid features selection approach for crops.Curr. Bioinform.202318758759710.2174/1574893618666230316151648
    [Google Scholar]
  35. YuH. DaiZ. SNNRice6mA: A deep learning method for predicting DNA N6-methyladenine sites in rice genome.Front. Genet.201910107110.3389/fgene.2019.0107131681441
    [Google Scholar]
  36. LvH. DaoF.Y. GuanZ.X. ZhangD. TanJ.X. ZhangY. ChenW. LinH. iDNA6mA-Rice: A computational tool for detecting N6-methyladenine sites in rice.Front. Genet.20191079310.3389/fgene.2019.0079331552096
    [Google Scholar]
  37. HuangQ. ZhangJ. WeiL. GuoF. ZouQ. 6mA-RicePred: A method for identifying DNA N 6-methyladenine sites in the rice genome based on feature fusion.Front. Plant Sci.202011410.3389/fpls.2020.0000432076430
    [Google Scholar]
  38. LeN.Q.K. HoQ.T. NguyenV.N. ChangJ.S. BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection.Comput. Biol. Chem.20229910773210.1016/j.compbiolchem.2022.10773235863177
    [Google Scholar]
  39. KhaQ.H. HoQ.T. LeN.Q.K. Identifying SNARE proteins using an alignment-free method based on multiscan convolutional neural network and PSSM profiles.J. Chem. Inf. Model.202262194820482610.1021/acs.jcim.2c0103436166351
    [Google Scholar]
  40. ZhengK. ZhangX.L. WangL. YouZ.H. JiB.Y. LiangX. LiZ.W. SPRDA: A link prediction approach based on the structural perturbation to infer disease associated Piwi-interacting RNAs.Brief. Bioinform.2023241bbac49810.1093/bib/bbac49836445194
    [Google Scholar]
  41. LiY. HuX.G. WangL. LiP.P. YouZ.H. MNMDCDA: Prediction of circRNA–disease associations by learning mixed neighborhood information from multiple distances.Brief. Bioinform.2022236bbac47910.1093/bib/bbac47936384071
    [Google Scholar]
  42. WongL. WangL. YouZ.H. YuanC.A. HuangY.A. CaoM.Y. GKLOMLI: A link prediction model for inferring miRNA–lncRNA interactions by using Gaussian kernel-based method on network profile and linear optimization algorithm.BMC Bioinformatics202324118810.1186/s12859‑023‑05309‑w37158823
    [Google Scholar]
  43. WangL. WongL. YouZ.H. HuangD.S. AMDECDA: Attention mechanism combined with data ensemble strategy for predicting CircRNA-disease association.IEEE Trans. Big Data202311110.1109/TBDATA.2023.3334673
    [Google Scholar]
  44. HwangC.L. YoonK. Multiple Attribute Decision Making.Lecture Notes in Economics and Mathematical Systems198118610.1007/978‑3‑642‑48318‑9
    [Google Scholar]
  45. ChenC. ChenH. ZhangY. ThomasH.R. FrankM.H. HeY. XiaR. TBtools: An integrative toolkit developed for interactive analyses of big biological data.Mol. Plant20201381194120210.1016/j.molp.2020.06.00932585190
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936285544231221113226
Loading
/content/journals/cbio/10.2174/0115748936285544231221113226
Loading

Data & Media loading...

Supplements

Supplementary material is available on the publisher’s website along with the published article.

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test