Skip to content
2000
image of GenRepAI: Utilizing Artificial Intelligence to Identify Repeats in Genomic Suffix Trees

Abstract

Background

The human genome is densely populated with repetitive DNA sequences that play crucial roles in genomic functions and structures but are also implicated in over 40 human diseases. The computational challenge of identifying and characterizing these repeats is significant due to the complexity and size of the genome, which are overwhelming traditional algorithms.

Methods

To address these challenges, we propose GenRepAI, a deep learning framework to navigate and analyze genomic suffix trees. GenRepAI employs supervised machine learning classifiers trained on labeled datasets of repeat annotations and unsupervised anomaly detection to identify novel repeat sequences. The models are trained using convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and vision transformers to classify and annotate repeats within the human genome.

Results

GenRepAI is designed to comprehensively profile repeats that underlie various neurological diseases, allowing researchers to identify pathogenic expansions. The framework will integrate into existing genomic analysis pipelines, with the capability to screen patient genomes and highlight potential causal variants for further validation.

Conclusion

GenRepAI is set to become a foundational tool in genomics, leveraging artificial intelligence to enhance the characterization of repetitive sequences. It promises significant advancements in the molecular diagnosis of repeat expansion disorders and contributes to a deeper understanding of genomic structure and function, with broad applications in personalized medicine.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936303435240702112205
2024-07-09
2025-01-31
Loading full text...

Full text loading...

References

  1. Reich D.E. Cargill M. Bolk S. Linkage disequilibrium in the human genome. Nature 2001 411 6834 199 204 10.1038/3507559011346797
    [Google Scholar]
  2. Cordaux R. Batzer M.A. The impact of retrotransposons on human genome evolution. Nat. Rev. Genet. 2009 10 10 691 703 10.1038/nrg264019763152
    [Google Scholar]
  3. Enk J. Devault A. Debruyne R. Complete Columbian mammoth mitogenome suggests interbreeding with woolly mammoths. Genome Biol. 2011 12 5 R51 10.1186/gb‑2011‑12‑5‑r5121627792
    [Google Scholar]
  4. Gemayel R. Vinces M.D. Legendre M. Verstrepen K.J. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet. 2010 44 1 445 477 10.1146/annurev‑genet‑072610‑15504620809801
    [Google Scholar]
  5. Piriyapongsa J. Mariño-Ramírez L. Jordan I.K. Origin and evolution of human microRNAs from transposable elements. Genetics 2007 176 2 1323 1337 10.1534/genetics.107.07255317435244
    [Google Scholar]
  6. McEachern M.J. Blackburn E.H. Cap-prevented recombination between terminal telomeric repeat arrays (telomere CPR) maintains telomeres in Kluyveromyces lactis lacking telomerase. Genes Dev. 1996 10 14 1822 1834 10.1101/gad.10.14.18228698241
    [Google Scholar]
  7. Singh V. Pandey S. Bhardwaj A. From the reference human genome to human pangenome: Premise, promise and challenge. Front. Genet. 2022 13 1042550 10.3389/fgene.2022.104255036437921
    [Google Scholar]
  8. Al-Ghalith G. Knights D. BURST enables mathematically optimal short-read alignment for big data BioRxiv 2020 10.1101/2020.09.08.287128
    [Google Scholar]
  9. Zerbino D.R. Frankish A. Flicek P. Progress, challenges, and surprises in annotating the human genome. Annu. Rev. Genomics Hum. Genet. 2020 21 1 55 79 10.1146/annurev‑genom‑121119‑08341832421357
    [Google Scholar]
  10. Usdin K. Grabczyk E. DNA repeat expansions and human disease. Cell. Mol. Life Sci. 2000 57 6 914 931 10.1007/PL0000073410950307
    [Google Scholar]
  11. Lee G. Gommers R. Waselewski F. Wohlfahrt K. O’Leary A. PyWavelets: A Python package for wavelet analysis. J. Open Source Softw. 2019 4 36 1237 10.21105/joss.01237
    [Google Scholar]
  12. Carvalho C.M.B. Lupski J.R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 2016 17 4 224 238 10.1038/nrg.2015.2526924765
    [Google Scholar]
  13. Chowdhury R.R. Dhar J. Robinson S.M. MACI: A machine learning-based approach to identify drug classes of antibiotic resistance genes from metagenomic data. Comput. Biol. Med. 2023 167 107629 10.1016/j.compbiomed.2023.107629
    [Google Scholar]
  14. Sadeghi D. Shoeibi A. Ghassemi N. An overview of artificial intelligence techniques for diagnosis of Schizophrenia based on magnetic resonance imaging modalities: Methods, challenges, and future works. Comput. Biol. Med. 2022 146 105554 10.1016/j.compbiomed.2022.10555435569333
    [Google Scholar]
  15. Goyal M. Knackstedt T. Yan S. Hassanpour S. Artificial intelligence-based image classification methods for diagnosis of skin cancer: Challenges and opportunities. Comput. Biol. Med. 2020 127 104065 10.1016/j.compbiomed.2020.10406533246265
    [Google Scholar]
  16. Šatović E. Tunjić Cvitanić M. Plohl M. Tools and databases for solving problems in detection and identification of repetitive DNA sequences. Period. Biol. 2020 121-122 1-2 7 14 10.18054/pb.v121‑122i1‑2.10571
    [Google Scholar]
  17. Berger B. Yu Y.W. Navigating bottlenecks and trade-offs in genomic data analysis. Nat. Rev. Genet. 2023 24 4 235 250 10.1038/s41576‑022‑00551‑z36476810
    [Google Scholar]
  18. Gusfield D. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press 1997 10.1017/CBO9780511574931
    [Google Scholar]
  19. Kaniwa F Kuthadi VM Dinakenyane O Schroeder H Alphabet-dependent Parallel Algorithm for Suffix Tree Construction for Pattern Searching 1704 056602017
    [Google Scholar]
  20. Comazzetto S. Shen B. Morrison S.J. Niches that regulate stem cells and hematopoiesis in adult bone marrow. Dev. Cell 2021 56 13 1848 1860 10.1016/j.devcel.2021.05.01834146467
    [Google Scholar]
  21. Huang J. Jiang S. Tan C. Zhu D. Structural elucidation of three new steroid sapogenins. Youji Huaxue 2002 22 11 917
    [Google Scholar]
  22. Roberts B. A Workflow for Genome Assembly and Annotation of Hāpuku (Polyprion oxygeneios) Master's thesis, Te Herenga Waka- Victoria University of Wellington 2024
    [Google Scholar]
  23. Nawaz M.S. Fournier-Viger P. Shojaee A. Fujita H. Using artificial intelligence techniques for COVID-19 genome analysis. Appl. Intell. 2021 51 5 3086 3103 10.1007/s10489‑021‑02193‑w34764587
    [Google Scholar]
  24. Li Z. Liu F. Yang W. Peng S. Zhou J. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022 33 12 6999 7019 10.1109/TNNLS.2021.308482734111009
    [Google Scholar]
  25. Lindemann B. Müller T. Vietz H. Jazdi N. Weyrich M. A survey on long short-term memory networks for time series prediction. Procedia CIRP 2021 99 650 655 10.1016/j.procir.2021.03.088
    [Google Scholar]
  26. Kasselimi E. Pefani D.E. Taraviras S. Lygerou Z. Ribosomal DNA and the nucleolus at the heart of aging. Trends Biochem. Sci. 2022 47 4 328 341 10.1016/j.tibs.2021.12.00735063340
    [Google Scholar]
  27. Jenny H. Alonso E.G. Wang Y. Minguez R. Using artificial intelligence for smart water management systems. 2020 Available from: https://www.adb.org/publications/artificial-intelligence-watersupply-ufw(accessed on 15-6-2024)
  28. Mao H. Wang H. SINE_scan: an efficient tool to discover short interspersed nuclear elements (SINEs) in large-scale genomic datasets. Bioinformatics 2017 33 5 743 745 10.1093/bioinformatics/btw71828062442
    [Google Scholar]
  29. Eickbush T.H. Eickbush D.G. Finely orchestrated movements: evolution of the ribosomal RNA genes. Genetics 2007 175 2 477 485 10.1534/genetics.107.07139917322354
    [Google Scholar]
  30. Brouha B. Meischl C. Ostertag E. Evidence consistent with human L1 retrotransposition in maternal meiosis I. Am. J. Hum. Genet. 2002 71 2 327 336 10.1086/34172212094329
    [Google Scholar]
  31. Dewannieux M. Esnault C. Heidmann T. LINE-mediated retrotransposition of marked Alu sequences. Nat. Genet. 2003 35 1 41 48 10.1038/ng122312897783
    [Google Scholar]
  32. Watkins W.S. Feusier J.E. Thomas J. Goubert C. Mallick S. Jorde L.B. The Simons Genome Diversity Project: a global analysis of mobile element diversity. Genome Biol. Evol. 2020 12 6 779 794 10.1093/gbe/evaa08632359137
    [Google Scholar]
  33. Kronenberg Z.N. High-resolution comparative analysis of great ape genomes. Science 2018 360 6393 eaar6343 10.1126/science.aar6343
    [Google Scholar]
  34. Kim S. Chen J. Cheng T. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019 47 D1 D1102 D1109 10.1093/nar/gky103330371825
    [Google Scholar]
  35. Chuong E.B. Elde N.C. Feschotte C. Regulatory evolution of innate immunity through co-option of endogenous retroviruses. Science 2016 351 6277 1083 1087 10.1126/science.aad5497
    [Google Scholar]
  36. Doggett N. Callen D. Report of the third international workshop on human chromosome 16 mapping 1994. Cytogenet. Genome Res. 1995 68 3-4 165 184 10.1159/0001339097842732
    [Google Scholar]
  37. La Spada A.R. Taylor J.P. Repeat expansion disease: progress and puzzles in disease pathogenesis. Nat. Rev. Genet. 2010 11 4 247 258 10.1038/nrg274820177426
    [Google Scholar]
  38. Usdin K. The biological effects of simple tandem repeats: Lessons from the repeat expansion diseases: Table 1. Genome Res. 2008 18 7 1011 1019 10.1101/gr.070409.10718593815
    [Google Scholar]
  39. MacDonald M. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell 1993 72 6 971 983 10.1016/0092‑8674(93)90585‑E8458085
    [Google Scholar]
  40. Wong C.S.C. Duzgoren-Aydin N.S. Aydin A. Wong M.H. Evidence of excessive releases of metals from primitive e-waste processing in Guiyu, China. Environ. Pollut. 2007 148 1 62 72 10.1016/j.envpol.2006.11.00617240013
    [Google Scholar]
  41. Mardis E.R. DNA sequencing technologies: 2006–2016. Nat. Protoc. 2017 12 2 213 218 10.1038/nprot.2016.18228055035
    [Google Scholar]
  42. Clark A. The theory of adsorption and catalysis. Academic Press 2018
    [Google Scholar]
  43. Carneiro M.O. Russ C. Ross M.G. Gabriel S.B. Nusbaum C. DePristo M.A. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 2012 13 1 375 10.1186/1471‑2164‑13‑37522863213
    [Google Scholar]
  44. Poplin R. Chang P.C. Alexander D. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018 36 10 983 987 10.1038/nbt.423530247488
    [Google Scholar]
  45. Higgins F.R. The pseudo-cleft construction in English. Routledge 2015 10.4324/9781315693545
    [Google Scholar]
  46. Liu S. Wang X. Liu M. Zhu J. Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics 2017 1 1 48 56 10.1016/j.visinf.2017.01.006
    [Google Scholar]
  47. Abecasis G.R. Auton A. Brooks L.D. An integrated map of genetic variation from 1,092 human genomes. Nature 2012 491 7422 56 65 10.1038/nature1163223128226
    [Google Scholar]
  48. Arner D.W. Barberis J.N. Buckley R.P. The evolution of Fintech: A new post-crisis paradigm. SSRN Electronic Journal 2015 47 1271 10.2139/ssrn.2676553
    [Google Scholar]
  49. Vidal A. Mengelers M. Yang S. De Saeger S. De Boevre M. Mycotoxin biomarkers of exposure: A comprehensive review. Compr. Rev. Food Sci. Food Saf. 2018 17 5 1127 1155 10.1111/1541‑4337.1236733350155
    [Google Scholar]
  50. Fungtammasan A. Ananda G. Hile S.E. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications. Genome Res. 2015 25 5 736 749 10.1101/gr.185892.11425823460
    [Google Scholar]
  51. Rucci M. Poletti M. Control and functions of fixational eye movements. Annu. Rev. Vis. Sci. 2015 1 1 499 518 10.1146/annurev‑vision‑082114‑03574227795997
    [Google Scholar]
  52. Bao W. Kojima K.K. Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 2015 6 1 11 10.1186/s13100‑015‑0041‑926045719
    [Google Scholar]
  53. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 27 2 573 580 10.1093/nar/27.2.5739862982
    [Google Scholar]
  54. Dolzhenko E. van Vugt J.J.F.A. Shaw R.J. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 2017 27 11 1895 1903 10.1101/gr.225672.11728887402
    [Google Scholar]
  55. Gusfield D. Suffix trees (and relatives) come of age in bioinformatics. ProceedingsIEEE Computer Society Bioinformatics Conference.16-16 August 2002 Stanford, CA USA 2002 1 3 10.1109/CSB.2002.1039321
    [Google Scholar]
  56. Nong G. Practical linear-time O (1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 2013 31 3 1 15 10.1145/2493175.2493180
    [Google Scholar]
  57. Kurtz S. The Vmatch large scale sequence analysis software. Available from 2003 Available from: http://www.vmatch.de/(accessed on 15-6-2024)
  58. Witherspoon C.L. Bergner J. Cockrell C. Stone D.N. Antecedents of organizational knowledge sharing: a meta‐analysis and critique. J. Knowl. Manage. 2013 17 2 250 277 10.1108/13673271311315204
    [Google Scholar]
  59. Tian S. Zhang J. Shu X. Chen L. Niu X. Wang Y. A novel evaluation strategy to artificial neural network model based on bionics. J. Bionics Eng. 2022 19 1 224 239
    [Google Scholar]
  60. Wang G. Ma R. Meng Q. Liu W. Maximum non-gaussianity estimation revisit: Uniqueness analysis from the perspective of constrained cost function optimization. Int. J. Pattern Recognit. Artif. Intell. 2018 32 5 1851004 10.1142/S0218001418510047
    [Google Scholar]
  61. Zook J.M. Hansen N.F. Olson N.D. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 2020 38 11 1347 1355 10.1038/s41587‑020‑0538‑832541955
    [Google Scholar]
  62. Van der Auwera GA Carneiro MO Hartl C From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 2013 43 1 10.1 33 10.1002/0471250953.bi1110s4325431634
    [Google Scholar]
  63. Hubley S. Lynch S.B. Schneck C. Thomas M. Shore J. Review of key telepsychiatry outcomes. World J. Psychiatry 2016 6 2 269 282 10.5498/wjp.v6.i2.26927354970
    [Google Scholar]
  64. Krejci D. Mier-Hicks F. Thomas R. Haag T. Lozano P. Emission characteristics of passively fed electrospray microthrusters with propellant reservoirs. J. Spacecr. Rockets 2017 54 2 447 458 10.2514/1.A33531
    [Google Scholar]
  65. Rosenbloom D. Markard J. Geels F.W. Fuenfschilling L. Why carbon pricing is not sufficient to mitigate climate change—and how “sustainability transition policy” can help. Proc. Natl. Acad. Sci. USA 2020 117 16 8664 8668 10.1073/pnas.200409311732269079
    [Google Scholar]
  66. Anzar SM Subheesh NP Panthakkan A Malayil S Ahmad HA Random Interval Attendance Management System (RIAMS): A Novel Multimodal Approach for Post-COVID Virtual Learning. IEEE Access 2021 9 91001 16 10.1109/ACCESS.2021.3092260
    [Google Scholar]
  67. Khoussainova N. Balazinska M. Gatterbauer W. Kwon Y. Suciu D A case for a collaborative query management system. arXiv 2009 0909 1778
    [Google Scholar]
  68. Nguyen H.D. A Two-sample Kolmogorov-Smirnov-like test for big data 15th Australasian Conference, AusDM 2017. August 19 Melbourne, VIC, Australia 2017 89 106 10.1007/978‑981‑13‑0292‑3_6
    [Google Scholar]
  69. Angermueller C. Pärnamaa T. Parts L. Stegle O. Deep learning for computational biology. Mol. Syst. Biol. 2016 12 7 878 10.15252/msb.2015665127474269
    [Google Scholar]
  70. Zoph B. Vasudevan V. Shlens J. Le Q.V. Learning transferable architectures for scalable image recognition Proceedings of the IEEE conference on computer vision and pattern recognition. 18-23 June Salt Lake City UT, USA 2018 10.1109/CVPR.2018.00907
    [Google Scholar]
  71. Martínez F. Martínez F. Jacinto E. Performance evaluation of the NASNet convolutional network in the automatic identification of COVID-19. Int. J. Adv. Sci. Eng. Inf. Technol. 2020 10 2 662 10.18517/ijaseit.10.2.11446
    [Google Scholar]
  72. Zhang Z. Park C.Y. Theesfeld C.L. Troyanskaya O.G. An automated framework for efficiently designing deep convolutional neural networks in genomics. Nat. Mach. Intell. 2021 3 5 392 400 10.1038/s42256‑021‑00316‑z
    [Google Scholar]
  73. Rao A. Barkley D. França G.S. Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature 2021 596 7871 211 220 10.1038/s41586‑021‑03634‑934381231
    [Google Scholar]
  74. Satorras V.G. Hoogeboom E. Welling M.E (n) equivariant graph neural networks. arXiv:2102.09844 2021
    [Google Scholar]
  75. Goodfellow I. Bengio Y. Courville A. Deep learning. MIT press 2016
    [Google Scholar]
  76. Li J. Sun A. Han J. Li C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2022 34 1 50 70 10.1109/TKDE.2020.2981314
    [Google Scholar]
  77. Bacciu D. Errica F. Micheli A. Podda M. A gentle introduction to deep learning for graphs. Neural Netw. 2020 129 203 221 10.1016/j.neunet.2020.06.00632559609
    [Google Scholar]
  78. Sakurada M. Yairi T. Anomaly detection using autoencoders with nonlinear dimensionality reduction Proceedings of the MLSDA 2014 2nd workshop on machine learning for sensory data analysis. December 2, Australia QLD, Gold Coast, Australia 2014 4 11 10.1145/2689746.2689747
    [Google Scholar]
  79. Eraslan G. Avsec Ž. Gagneur J. Theis F.J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 2019 20 7 389 403 10.1038/s41576‑019‑0122‑630971806
    [Google Scholar]
  80. Krusche P. Trigg L. Boutros P.C. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 2019 37 5 555 560 10.1038/s41587‑019‑0054‑x30858580
    [Google Scholar]
  81. Mikheyev A.S. Tin M.M.Y. A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Resour. 2014 14 6 1097 1102 10.1111/1755‑0998.1232425187008
    [Google Scholar]
  82. Martin M. WhatsHap: fast and accurate read-based phasing BioRxiv 2016 85050
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936303435240702112205
Loading
/content/journals/cbio/10.2174/0115748936303435240702112205
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test