Skip to content
2000
image of Exploring Coding Sequence Length Distributions Across Taxonomic Kingdoms Based on Maximum Information Principle

Abstract

Background

Genetic information about organisms' traits is stored and encoded in deoxyribonucleic acid (DNA) sequences. The fundamental inquiry into the storage mechanisms of this genetic information within genomes has long been of interest to geneticists and biophysicists.

Objective

The objective of this study was to investigate the distribution of coding sequence (CDS) lengths in species genomes across different kingdoms.

Methods

In this study, we used the maximum entropy principle and the gamma distribution model based on a comprehensive dataset including viruses, archaea, bacteria, and eukaryote species.

Results

Our study result revealed unique patterns in CDS length distributions among kingdoms and CDS lengths exhibit a right-skewed distribution, with varying preferences among kingdoms. Eukaryotes displayed bimodal distributions, with CDS sequences longer than those of prokaryotes. Fitting the gamma distribution model revealed differences in shape and scale parameters among kingdoms, with eukaryotes exhibiting larger scale parameters, indicating longer CDS sequences. Additionally, analysis of moments highlighted the complexity of eukaryotic genomes relative to prokaryotes.

Conclusion

This study result deepens our understanding of genome evolution and provides valuable insights for biological research.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936355149250108083111
2025-01-30
2025-05-05
Loading full text...

Full text loading...

References

  1. Beadle G.W. Tatum E.L. Genetic control of biochemical reactions in neurospora. Proc. Natl. Acad. Sci. USA 1941 27 11 499 506 10.1073/pnas.27.11.499 16588492
    [Google Scholar]
  2. Wu R. DNA sequence analysis. Annu. Rev. Biochem. 1978 47 1 607 634 10.1146/annurev.bi.47.070178.003135 209729
    [Google Scholar]
  3. Wang Y. Zhai Y. Ding Y. Zou Q. SBSM-Pro: Support bio-sequence machine for proteins. arXiv 2023
    [Google Scholar]
  4. Cao C. Shao M. Zuo C. Kwok D. Liu L. Ge Y. Zhang Z. Cui F. Chen M. Fan R. Ding Y. Jiang H. Wang G. Zou Q. RAVAR: A curated repository for rare variant–trait associations. Nucleic Acids Res. 2024 52 D1 D990 D997 10.1093/nar/gkad876 37831073
    [Google Scholar]
  5. Stein L. Genome annotation: From sequence to biology. Nat. Rev. Genet. 2001 2 7 493 503 10.1038/35080529 11433356
    [Google Scholar]
  6. Qiao J. Jin J. Yu H. Wei L. Towards retraining-free RNA modification prediction with incremental learning. Inf. Sci. 2024 660 120105 10.1016/j.ins.2024.120105
    [Google Scholar]
  7. Wang L. Ding Y. Tiwari P. Xu J. Lu W. Muhammad K. de Albuquerquee V.H.C. Guo F. A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites. Inf. Sci. 2023 630 40 52 10.1016/j.ins.2023.01.149
    [Google Scholar]
  8. Ren L. Ning L. Yang Y. Yang T. Li X. Tan S. Ge P. Li S. Luo N. Tao P. Zhang Y. MetaboliteCOVID: A manually curated database of metabolite markers for COVID-19. Comput. Biol. Med. 2023 167 107661 10.1016/j.compbiomed.2023.107661 37925911
    [Google Scholar]
  9. Ghorbani M. Karimi H. Bioinformatics approaches for gene finding. Int. J. Sci. Res. Sci. Technol. 2015 1 4
    [Google Scholar]
  10. Wang R. Jiang Y. Jin J. Yin C. Yu H. Wang F. Feng J. Su R. Nakai K. Zou Q. Wei L. DeepBIO: An automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis. Nucleic Acids Res. 2023 51 7 3017 3029 10.1093/nar/gkad055 36796796
    [Google Scholar]
  11. Zhu H. Hao H. Yu L. Identifying disease-related microbes based on multi-scale variational graph autoencoder embedding Wasserstein distance. BMC Biol. 2023 21 1 294 10.1186/s12915‑023‑01796‑8 38115088
    [Google Scholar]
  12. Zhang Y. Liu C. Liu M. Liu T. Lin H. Huang C.B. Ning L. Attention is all you need: Utilizing attention in AI-enabled drug discovery. Brief. Bioinform. 2023 25 1 bbad467 10.1093/bib/bbad467 38189543
    [Google Scholar]
  13. Cobb M. 60 years ago, Francis Crick changed the logic of biology. PLoS Biol. 2017 15 9 e2003243 10.1371/journal.pbio.2003243 28922352
    [Google Scholar]
  14. Shimizu M. In Origin and evolution of the genetic code, Origin of Life Proceedings of the Third ISSOL Meeting and the Sixth ICOL Meeting Jerusalem June 22–27, 1980 pp 423-430
    [Google Scholar]
  15. Singh U. Wurtele E.S. orfipy: A fast and flexible tool for extracting ORFs. Bioinformatics 2021 37 18 3019 3020 10.1093/bioinformatics/btab090 33576786
    [Google Scholar]
  16. Chen S. Krinsky B.H. Long M. New genes as drivers of phenotypic evolution. Nat. Rev. Genet. 2013 14 9 645 660 10.1038/nrg3521 23949544
    [Google Scholar]
  17. Jin J. Yu Y. Wang R. Zeng X. Pang C. Jiang Y. Li Z. Dai Y. Su R. Zou Q. Nakai K. Wei L. iDNA-ABF: Multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol. 2022 23 1 219 10.1186/s13059‑022‑02780‑1 36253864
    [Google Scholar]
  18. Zhao M. He W. Tang J. Zou Q. Guo F. A hybrid deep learning framework for gene regulatory network inference from single-cell transcriptomic data. Brief. Bioinform. 2022 23 2 bbab568 10.1093/bib/bbab568 35062026
    [Google Scholar]
  19. Zhang Y. Pan X. Shi T. Gu Z. Yang Z. Liu M. Xu Y. Yang Y. Ren L. Song X. Lin H. Deng K. P450Rdb: A manually curated database of reactions catalyzed by cytochrome P450 enzymes. J. Adv. Res. 2023 63 32 45 37871773
    [Google Scholar]
  20. Kearse M.G. Wilusz J.E. Non-AUG translation: A new start for protein synthesis in eukaryotes. Genes Dev. 2017 31 17 1717 1731 10.1101/gad.305250.117 28982758
    [Google Scholar]
  21. Ren L. Xu Y. Ning L. Pan X. Li Y. Zhao Q. Pang B. Huang J. Deng K. Zhang Y. TCM2COVID: A resource of anti‐COVID‐19 traditional Chinese medicine with effects and mechanisms. iMeta 2022 1 4 e42 10.1002/imt2.42 36245702
    [Google Scholar]
  22. Liu Y. Shen X. Gong Y. Liu Y. Song B. Zeng X. Sequence Alignment/Map format: A comprehensive review of approaches and applications. Brief. Bioinform. 2023 24 5 bbad320 10.1093/bib/bbad320 37668049
    [Google Scholar]
  23. Tekaia F. Yeramian E. Dujon B. Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: A global picture with correspondence analysis. Gene 2002 297 1-2 51 60 10.1016/S0378‑1119(02)00871‑5 12384285
    [Google Scholar]
  24. Morariu V.V. Distribution and Correlation of the coding sequence lengths in bacterial genomes. J. Chem. 2008 59 11 1201 1204 10.37358/RC.08.11.2000
    [Google Scholar]
  25. Zhang J. Protein-length distributions for the three domains of life. Trends Genet. 2000 16 3 107 109 10.1016/S0168‑9525(99)01922‑8 10689349
    [Google Scholar]
  26. Feng L. Li H. The distribution model of open reading frame length in different genomes and the genome evolution. ACTA BIOPHYSICA SINICA 2004 20 5 375 381
    [Google Scholar]
  27. Boltzmann L. Lectures on gas theory. Univ of California Press 2023 10.2307/jj.8501520
    [Google Scholar]
  28. Haken H. Information and Self-Organization A Macroscopic Approach to Complex Systems. Springer Berlin Heidelberg 1988
    [Google Scholar]
  29. Luo L. Bai G. The maximum information principle and the evolution of nucleotide sequences. J. Theor. Biol. 1995 174 2 131 136 10.1006/jtbi.1995.0086 7643609
    [Google Scholar]
  30. Lynch M. Conery J.S. The origins of genome complexity. Science 2003 302 5649 1401 1404 10.1126/science.1089370 14631042
    [Google Scholar]
  31. Li H.L. Pang Y.H. Liu B. BioSeq-BLM: A platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Res. 2021 49 22 e129 10.1093/nar/gkab829 34581805
    [Google Scholar]
  32. Liu B. Gao X. Zhang H. BioSeq-Analysis2.0: An updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019 47 20 e127 10.1093/nar/gkz740 31504851
    [Google Scholar]
  33. Chen L. Yu L. Gao L. Potent antibiotic design via guided search from antibacterial activity evaluations. Bioinformatics 2023 39 2 btad059 10.1093/bioinformatics/btad059 36707990
    [Google Scholar]
  34. Kimura M. The neutral theory of molecular evolution. Sci. Am. 1979 241 5 98 126, 102, 108 passim 10.1038/scientificamerican1179‑98 504979
    [Google Scholar]
  35. Zhang R. A rebuttal to the comments on the genome order index and the Z-curve. Biol. Direct 2011 6 1 10 10.1186/1745‑6150‑6‑10 21324187
    [Google Scholar]
  36. Brocchieri L. Karlin S. Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 2005 33 10 3390 3400 10.1093/nar/gki615 15951512
    [Google Scholar]
  37. Tiessen A. Pérez-Rodríguez P. Delaye-Arredondo L.J. Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes. BMC Res. Notes 2012 5 1 85 10.1186/1756‑0500‑5‑85 22296664
    [Google Scholar]
  38. Ramírez-Sánchez O. Pérez-Rodríguez P. Delaye L. Tiessen A. Plant proteins are smaller because they are encoded by fewer exons than animal proteins. Genom. Proteom. Bioinform.. 2016 14 6 357 370 10.1016/j.gpb.2016.06.003 27998811
    [Google Scholar]
  39. Nevers Y. Glover N.M. Dessimoz C. Lecompte O. Protein length distribution is remarkably uniform across the tree of life. Genome Biol. 2023 24 1 135 10.1186/s13059‑023‑02973‑2 37291671
    [Google Scholar]
  40. Koonin E.V. Wolf Y.I. Genomics of bacteria and archaea: The emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008 36 21 6688 6719 10.1093/nar/gkn668 18948295
    [Google Scholar]
  41. Long X. Xue H. Wong J.T.F. Descent of bacteria and eukarya from an archaeal root of life. Evol. Bioinform. Online 2020 16 1176934320908267 10.1177/1176934320908267 32636606
    [Google Scholar]
  42. Belshaw R. Pybus O.G. Rambaut A. The evolution of genome compression and genomic novelty in RNA viruses. Genome Res. 2007 17 10 1496 1504 10.1101/gr.6305707 17785537
    [Google Scholar]
  43. Jayaraman B. Smith A.M. Fernandes J.D. Frankel A.D. Oligomeric viral proteins: small in size, large in presence. Crit. Rev. Biochem. Mol. Biol. 2016 51 5 379 394 10.1080/10409238.2016.1215406 27685368
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936355149250108083111
Loading
/content/journals/cbio/10.2174/0115748936355149250108083111
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test