Skip to content
2000
Volume 20, Issue 6
  • ISSN: 1574-8936
  • E-ISSN: 2212-392X

Abstract

Background

The human genome is densely populated with repetitive DNA sequences that play crucial roles in genomic functions and structures but are also implicated in over 40 human diseases. The computational challenge of identifying and characterizing these repeats is significant due to the complexity and size of the genome, which are overwhelming traditional algorithms.

Methods

To address these challenges, we propose GenRepAI, a deep learning framework to navigate and analyze genomic suffix trees. GenRepAI employs supervised machine learning classifiers trained on labeled datasets of repeat annotations and unsupervised anomaly detection to identify novel repeat sequences. The models are trained using convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and vision transformers to classify and annotate repeats within the human genome.

Results

GenRepAI is designed to comprehensively profile repeats that underlie various neurological diseases, allowing researchers to identify pathogenic expansions. The framework will integrate into existing genomic analysis pipelines, with the capability to screen patient genomes and highlight potential causal variants for further validation.

Conclusion

GenRepAI is set to become a foundational tool in genomics, leveraging artificial intelligence to enhance the characterization of repetitive sequences. It promises significant advancements in the molecular diagnosis of repeat expansion disorders and contributes to a deeper understanding of genomic structure and function, with broad applications in personalized medicine.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936303435240702112205
2025-07-01
2025-05-28
Loading full text...

Full text loading...

References

  1. ReichD.E. CargillM. BolkS. Linkage disequilibrium in the human genome.Nature2001411683419920410.1038/35075590 11346797
    [Google Scholar]
  2. CordauxR. BatzerM.A. The impact of retrotransposons on human genome evolution.Nat. Rev. Genet.2009101069170310.1038/nrg2640 19763152
    [Google Scholar]
  3. EnkJ. DevaultA. DebruyneR. Complete Columbian mammoth mitogenome suggests interbreeding with woolly mammoths.Genome Biol.2011125R5110.1186/gb‑2011‑12‑5‑r51 21627792
    [Google Scholar]
  4. GemayelR. VincesM.D. LegendreM. VerstrepenK.J. Variable tandem repeats accelerate evolution of coding and regulatory sequences.Annu. Rev. Genet.201044144547710.1146/annurev‑genet‑072610‑155046 20809801
    [Google Scholar]
  5. PiriyapongsaJ. Mariño-RamírezL. JordanI.K. Origin and evolution of human microRNAs from transposable elements.Genetics200717621323133710.1534/genetics.107.072553 17435244
    [Google Scholar]
  6. McEachernM.J. BlackburnE.H. Cap-prevented recombination between terminal telomeric repeat arrays (telomere CPR) maintains telomeres in Kluyveromyces lactis lacking telomerase.Genes Dev.199610141822183410.1101/gad.10.14.1822 8698241
    [Google Scholar]
  7. SinghV. PandeyS. BhardwajA. From the reference human genome to human pangenome: Premise, promise and challenge.Front. Genet.202213104255010.3389/fgene.2022.1042550 36437921
    [Google Scholar]
  8. Al-GhalithG. KnightsD. BURST enables mathematically optimal short-read alignment for big data.BioRxiv2009-20.2020;10.1101/2020.09.08.287128
    [Google Scholar]
  9. ZerbinoD.R. FrankishA. FlicekP. Progress, challenges, and surprises in annotating the human genome.Annu. Rev. Genomics Hum. Genet.2020211557910.1146/annurev‑genom‑121119‑083418 32421357
    [Google Scholar]
  10. UsdinK. GrabczykE. DNA repeat expansions and human disease.Cell. Mol. Life Sci.200057691493110.1007/PL00000734 10950307
    [Google Scholar]
  11. LeeG. GommersR. WaselewskiF. WohlfahrtK. O’LearyA. PyWavelets: A Python package for wavelet analysis.J. Open Source Softw.2019436123710.21105/joss.01237
    [Google Scholar]
  12. CarvalhoC.M.B. LupskiJ.R. Mechanisms underlying structural variant formation in genomic disorders.Nat. Rev. Genet.201617422423810.1038/nrg.2015.25 26924765
    [Google Scholar]
  13. ChowdhuryR.R. DharJ. RobinsonS.M. MACI: A machine learning-based approach to identify drug classes of antibiotic resistance genes from metagenomic data.Comput. Biol. Med.202316710762910.1016/j.compbiomed.2023.107629
    [Google Scholar]
  14. SadeghiD. ShoeibiA. GhassemiN. An overview of artificial intelligence techniques for diagnosis of Schizophrenia based on magnetic resonance imaging modalities: Methods, challenges, and future works.Comput. Biol. Med.202214610555410.1016/j.compbiomed.2022.105554 35569333
    [Google Scholar]
  15. GoyalM. KnackstedtT. YanS. HassanpourS. Artificial intelligence-based image classification methods for diagnosis of skin cancer: Challenges and opportunities.Comput. Biol. Med.202012710406510.1016/j.compbiomed.2020.104065 33246265
    [Google Scholar]
  16. ŠatovićE. Tunjić CvitanićM. PlohlM. Tools and databases for solving problems in detection and identification of repetitive DNA sequences.Period. Biol.2020121-1221-271410.18054/pb.v121‑122i1‑2.10571
    [Google Scholar]
  17. BergerB. YuY.W. Navigating bottlenecks and trade-offs in genomic data analysis.Nat. Rev. Genet.202324423525010.1038/s41576‑022‑00551‑z 36476810
    [Google Scholar]
  18. GusfieldD. Algorithms on strings, trees and sequences: computer science and computational biology.Cambridge university press199710.1017/CBO9780511574931
    [Google Scholar]
  19. KaniwaF. KuthadiV.M. DinakenyaneO. SchroederH. Alphabetdependent parallel algorithm for suffix tree construction for pattern searching.Int. J. Grid Distributed Comput.2017101920
    [Google Scholar]
  20. ComazzettoS. ShenB. MorrisonS.J. Niches that regulate stem cells and hematopoiesis in adult bone marrow.Dev. Cell202156131848186010.1016/j.devcel.2021.05.018 34146467
    [Google Scholar]
  21. HuangJ. JiangS. TanC. ZhuD. Structural elucidation of three new steroid sapogenins.Youji Huaxue20022211917
    [Google Scholar]
  22. RobertsB. A Workflow for Genome Assembly and Annotation of Hāpuku (Polyprion oxygeneios).Master's thesis, Te Herenga Waka-Victoria University of Wellington2024
    [Google Scholar]
  23. NawazM.S. Fournier-VigerP. ShojaeeA. FujitaH. Using artificial intelligence techniques for COVID-19 genome analysis.Appl. Intell.20215153086310310.1007/s10489‑021‑02193‑w 34764587
    [Google Scholar]
  24. LiZ. LiuF. YangW. PengS. ZhouJ. A survey of convolutional neural networks: analysis, applications, and prospects.IEEE Trans. Neural Netw. Learn. Syst.202233126999701910.1109/TNNLS.2021.3084827 34111009
    [Google Scholar]
  25. LindemannB. MüllerT. VietzH. JazdiN. WeyrichM. A survey on long short-term memory networks for time series prediction.Procedia CIRP20219965065510.1016/j.procir.2021.03.088
    [Google Scholar]
  26. KasselimiE. PefaniD.E. TaravirasS. LygerouZ. Ribosomal DNA and the nucleolus at the heart of aging.Trends Biochem. Sci.202247432834110.1016/j.tibs.2021.12.007 35063340
    [Google Scholar]
  27. JennyH. AlonsoE.G. WangY. MinguezR. Using artificial intelligence for smart water management systems.2020Available from: https://www.adb.org/publications/artificial-intelligence-watersupply-ufw (accessed on 15-6-2024)
    [Google Scholar]
  28. MaoH. WangH. SINE_scan: an efficient tool to discover short interspersed nuclear elements (SINEs) in large-scale genomic datasets.Bioinformatics201733574374510.1093/bioinformatics/btw718 28062442
    [Google Scholar]
  29. EickbushT.H. EickbushD.G. Finely orchestrated movements: evolution of the ribosomal RNA genes.Genetics2007175247748510.1534/genetics.107.071399 17322354
    [Google Scholar]
  30. BrouhaB. MeischlC. OstertagE. Evidence consistent with human L1 retrotransposition in maternal meiosis I.Am. J. Hum. Genet.200271232733610.1086/341722 12094329
    [Google Scholar]
  31. DewannieuxM. EsnaultC. HeidmannT. LINE-mediated retrotransposition of marked Alu sequences.Nat. Genet.2003351414810.1038/ng1223 12897783
    [Google Scholar]
  32. WatkinsW.S. FeusierJ.E. ThomasJ. GoubertC. MallickS. JordeL.B. The simons genome diversity project: a global analysis of mobile element diversity.Genome Biol. Evol.202012677979410.1093/gbe/evaa086 32359137
    [Google Scholar]
  33. KronenbergZ.N. High-resolution comparative analysis of great ape genomes.Science20183606393eaar634310.1126/science.aar6343
    [Google Scholar]
  34. KimS. ChenJ. ChengT. PubChem 2019 update: improved access to chemical data.Nucleic Acids Res.201947D1D1102D110910.1093/nar/gky1033 30371825
    [Google Scholar]
  35. ChuongE.B. EldeN.C. FeschotteC. Regulatory evolution of innate immunity through co-option of endogenous retroviruses.Science201635162771083108710.1126/science.aad5497
    [Google Scholar]
  36. DoggettN. CallenD. Report of the third international workshop on human chromosome 16 mapping 1994.Cytogenet. Genome Res.1995683-416518410.1159/000133909 7842732
    [Google Scholar]
  37. La SpadaA.R. TaylorJ.P. Repeat expansion disease: progress and puzzles in disease pathogenesis.Nat. Rev. Genet.201011424725810.1038/nrg2748 20177426
    [Google Scholar]
  38. UsdinK. The biological effects of simple tandem repeats: Lessons from the repeat expansion diseases: Table 1.Genome Res.20081871011101910.1101/gr.070409.107 18593815
    [Google Scholar]
  39. MacDonaldM.E. AmbroseC.M. DuyaoM.P. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes.Cell199372697198310.1016/0092‑8674(93)90585‑E 8458085
    [Google Scholar]
  40. WongC.S.C. Duzgoren-AydinN.S. AydinA. WongM.H. Evidence of excessive releases of metals from primitive e-waste processing in Guiyu, China.Environ. Pollut.20071481627210.1016/j.envpol.2006.11.006 17240013
    [Google Scholar]
  41. MardisE.R. DNA sequencing technologies: 2006–2016.Nat. Protoc.201712221321810.1038/nprot.2016.182 28055035
    [Google Scholar]
  42. ClarkA. The theory of adsorption and catalysis.Academic Press2018
    [Google Scholar]
  43. CarneiroM.O. RussC. RossM.G. GabrielS.B. NusbaumC. DePristoM.A. Pacific biosciences sequencing technology for genotyping and variation discovery in human data.BMC Genomics201213137510.1186/1471‑2164‑13‑375 22863213
    [Google Scholar]
  44. PoplinR. ChangP.C. AlexanderD. A universal SNP and small-indel variant caller using deep neural networks.Nat. Biotechnol.2018361098398710.1038/nbt.4235 30247488
    [Google Scholar]
  45. HigginsF.R. The pseudo-cleft construction in English.Routledge201510.4324/9781315693545
    [Google Scholar]
  46. LiuS. WangX. LiuM. ZhuJ. Towards better analysis of machine learning models: A visual analytics perspective.Visual Informatics201711485610.1016/j.visinf.2017.01.006
    [Google Scholar]
  47. AbecasisG.R. AutonA. BrooksL.D. An integrated map of genetic variation from 1,092 human genomes.Nature20124917422566510.1038/nature11632 23128226
    [Google Scholar]
  48. ArnerD.W. BarberisJ.N. BuckleyR.P. The evolution of Fintech: A new post-crisis paradigm.SSRN Electronic Journal201547127110.2139/ssrn.2676553
    [Google Scholar]
  49. VidalA. MengelersM. YangS. De SaegerS. De BoevreM. Mycotoxin biomarkers of exposure: A comprehensive review.Compr. Rev. Food Sci. Food Saf.20181751127115510.1111/1541‑4337.12367 33350155
    [Google Scholar]
  50. FungtammasanA. AnandaG. HileS.E. Accurate typing of short tandem repeats from genome-wide sequencing data and its applications.Genome Res.201525573674910.1101/gr.185892.114 25823460
    [Google Scholar]
  51. RucciM. PolettiM. Control and functions of fixational eye movements.Annu. Rev. Vis. Sci.20151149951810.1146/annurev‑vision‑082114‑035742 27795997
    [Google Scholar]
  52. BaoW. KojimaK.K. KohanyO. Repbase Update, a database of repetitive elements in eukaryotic genomes.Mob. DNA2015611110.1186/s13100‑015‑0041‑9 26045719
    [Google Scholar]
  53. BensonG. Tandem repeats finder: a program to analyze DNA sequences.Nucleic Acids Res.199927257358010.1093/nar/27.2.573 9862982
    [Google Scholar]
  54. DolzhenkoE. van VugtJ.J.F.A. ShawR.J. Detection of long repeat expansions from PCR-free whole-genome sequence data.Genome Res.201727111895190310.1101/gr.225672.117 28887402
    [Google Scholar]
  55. GusfieldD. Suffix trees (and relatives) come of age in bioinformatics. Proceedings IEEE Computer Society Bioinformatics Conference.16-16 August 2002, Stanford, CA, USA1310.1109/CSB.2002.1039321
    [Google Scholar]
  56. NongG. Practical linear-time O (1)-workspace suffix sorting for constant alphabets.ACM Trans. Inf. Syst.201331311510.1145/2493175.2493180
    [Google Scholar]
  57. KurtzS. The Vmatch large scale sequence analysis software.2003Available from: http://www.vmatch.de/(accessed on 15-6-2024)
    [Google Scholar]
  58. WitherspoonC.L. BergnerJ. CockrellC. StoneD.N. Antecedents of organizational knowledge sharing: a meta‐analysis and critique.J. Knowl. Manage.201317225027710.1108/13673271311315204
    [Google Scholar]
  59. TianS. ZhangJ. ShuX. ChenL. NiuX. WangY. A novel evaluation strategy to artificial neural network model based on bionics.J. Bionic Eng.2022191224239
    [Google Scholar]
  60. WangG. MaR. MengQ. LiuW. Maximum non-gaussianity estimation revisit: Uniqueness analysis from the perspective of constrained cost function optimization.Int. J. Pattern Recognit. Artif. Intell.2018325185100410.1142/S0218001418510047
    [Google Scholar]
  61. ZookJ.M. HansenN.F. OlsonN.D. A robust benchmark for detection of germline large deletions and insertions.Nat. Biotechnol.202038111347135510.1038/s41587‑020‑0538‑8 32541955
    [Google Scholar]
  62. Van der AuweraGA CarneiroMO HartlC From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.Curr Protoc Bioinformatics201343110.1-,33.10.1002/0471250953.bi1110s43 25431634
    [Google Scholar]
  63. HubleyS. LynchS.B. SchneckC. ThomasM. ShoreJ. Review of key telepsychiatry outcomes.World J. Psychiatry20166226928210.5498/wjp.v6.i2.269 27354970
    [Google Scholar]
  64. KrejciD. Mier-HicksF. ThomasR. HaagT. LozanoP. Emission characteristics of passively fed electrospray microthrusters with propellant reservoirs.J. Spacecr. Rockets201754244745810.2514/1.A33531
    [Google Scholar]
  65. RosenbloomD. MarkardJ. GeelsF.W. FuenfschillingL. Why carbon pricing is not sufficient to mitigate climate change—and how “sustainability transition policy” can help.Proc. Natl. Acad. Sci. USA2020117168664866810.1073/pnas.2004093117 32269079
    [Google Scholar]
  66. AnzarSM SubheeshNP PanthakkanA MalayilS AhmadHA Random interval attendance management system (RIAMS): A novel multimodal approach for post-COVID virtual learning.IEEE Access20219910019301610.1109/ACCESS.2021.3092260
    [Google Scholar]
  67. KhoussainovaN. BalazinskaM. GatterbauerW. KwonY. SuciuD A case for a collaborative query management systemArXiv. 0909.17782009
  68. NguyenH.D. A Two-sample Kolmogorov-Smirnov-like test for big data 15th Australasian Conference, AusDM 2017.August 19-20, 2017,Melbourne, VIC, Australia8910610.1007/978‑981‑13‑0292‑3‑6
    [Google Scholar]
  69. AngermuellerC. PärnamaaT. PartsL. StegleO. Deep learning for computational biology.Mol. Syst. Biol.201612787810.15252/msb.20156651 27474269
    [Google Scholar]
  70. ZophB. VasudevanV. ShlensJ. LeQ.V. Learning transferable architectures for scalable image recognition.Proceedings of the IEEE conference on computer vision and pattern recognition18-23 June 2018Salt Lake City, UT, USA20188697871010.1109/CVPR.2018.00907
    [Google Scholar]
  71. MartínezF. MartínezF. JacintoE. Performance evaluation of the NASNet convolutional network in the automatic identification of COVID-19.Int. J. Adv. Sci. Eng. Inf. Technol.202010266210.18517/ijaseit.10.2.11446
    [Google Scholar]
  72. ZhangZ. ParkC.Y. TheesfeldC.L. TroyanskayaO.G. An automated framework for efficiently designing deep convolutional neural networks in genomics.Nat. Mach. Intell.20213539240010.1038/s42256‑021‑00316‑z
    [Google Scholar]
  73. RaoA. BarkleyD. FrançaG.S. YanaiI. Exploring tissue architecture using spatial transcriptomics.Nature2021596787121122010.1038/s41586‑021‑03634‑9 34381231
    [Google Scholar]
  74. SatorrasV.G. HoogeboomE. WellingM. E(n) equivariant graph neural networksArXiv:2102.098442021
  75. GoodfellowI. BengioY. CourvilleA. Deep learning.MIT press2016
    [Google Scholar]
  76. LiJ. SunA. HanJ. LiC. A survey on deep learning for named entity recognition.IEEE Trans. Knowl. Data Eng.2022341507010.1109/TKDE.2020.2981314
    [Google Scholar]
  77. BacciuD. ErricaF. MicheliA. PoddaM. A gentle introduction to deep learning for graphs.Neural Netw.202012920322110.1016/j.neunet.2020.06.006 32559609
    [Google Scholar]
  78. SakuradaM. YairiT. Anomaly detection using autoencoders with nonlinear dimensionality reduction.Proceedings of the MLSDA 2014 2nd workshop on machine learning for sensory data analysis.December 2, 2014, Australia QLD, Gold Coast, Australia41110.1145/2689746.2689747
    [Google Scholar]
  79. EraslanG. AvsecŽ. GagneurJ. TheisF.J. Deep learning: new computational modelling techniques for genomics.Nat. Rev. Genet.201920738940310.1038/s41576‑019‑0122‑6 30971806
    [Google Scholar]
  80. KruscheP. TriggL. BoutrosP.C. Best practices for benchmarking germline small-variant calls in human genomes.Nat. Biotechnol.201937555556010.1038/s41587‑019‑0054‑x 30858580
    [Google Scholar]
  81. MikheyevA.S. TinM.M.Y. A first look at the Oxford Nanopore MinION sequencer.Mol. Ecol. Resour.20141461097110210.1111/1755‑0998.12324 25187008
    [Google Scholar]
  82. MartinM. PattersonM. GargS. WhatsHap: fast and accurate read-based phasingBioRxiv850502016
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936303435240702112205
Loading
/content/journals/cbio/10.2174/0115748936303435240702112205
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test