Skip to content
2000
image of 20D-Dynamic Representation of Protein Sequences Combined with 
K-means Clustering

Abstract

Objective

The objective of this research is to demonstrate that alignment-free bioinformatics approaches are effective tools for analyzing the similarity and dissimilarity of protein sequences. All numerical parameters representing sequences are expressed analytically, ensuring precision, clarity, and efficient processing, even for large datasets and long sequences. Additionally, a novel approach for identifying previously unknown virus strains is introduced.

Methods

A novel approach is proposed, integrating the unique features of our newly developed method, the 20D-Dynamic Representation of Protein Sequences, with the means clustering algorithm. The sequences are represented as clouds of material points in a 20-dimensional space (20D-dynamic graphs), with their spatial distribution being unique to each protein sequence. The numerical parameters, referred to as descriptors in molecular similarity theory, represent quantities characteristic of dynamic systems and serve as input data for the means clustering algorithm.

Results

Examples of the application of the approach are presented, including projections of the 20D-dynamic graphs onto 3D spaces, which serve as a visual tool for comparing sequences. Additionally, cluster plots for the analyzed sequences are provided using the proposed method.

Conclusion

It has been demonstrated that the 20D-Dynamic Representation of Protein Sequences, combined with the means clustering algorithm, successfully classifies subtypes of influenza A virus strains.

Loading

Article metrics loading...

/content/journals/cchts/10.2174/0113862073359729250220131623
2025-02-26
2025-03-28
Loading full text...

Full text loading...

References

  1. Du X. Zhu R. Li Y. Anjum A. Language model-based automatic prefix abbreviation expansion method for biomedical big data analysis. Future Gener. Comput. Syst. 2019 98 238 251 10.1016/j.future.2019.01.016 32287562
    [Google Scholar]
  2. Lötsch J. Malkusch S. Ultsch A. Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling). PLoS One 2021 16 8 e0255838 10.1371/journal.pone.0255838 34352006
    [Google Scholar]
  3. Ramanathan N. Ramamurthy J. Natarajan G. Numerical characterization of dna sequences for alignment-free sequence comparison – A review. Comb. Chem. High Throughput Screen. 2022 25 3 365 380 10.2174/1386207324666210811101437 34382516
    [Google Scholar]
  4. Anjum N. Nabil R.L. Rafi R.I. Bayzid M.S. Rahman M.S. CD-MAWS: An alignment-free phylogeny estimation method using cosine distance on minimal absent word sets. IEEE/ACM Trans. Comput. Biol. Bioinformatics 2023 20 1 196 205 10.1109/TCBB.2021.3136792 34928803
    [Google Scholar]
  5. Wang T. Yu Z.G. Li J. CGRWDL: Alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front. Microbiol. 2024 15 15 1339156 10.3389/fmicb.2024.1339156 38572227
    [Google Scholar]
  6. Shanan N.A. Lafta H. Alrashid S. Using alignment-free methods as preprocessing stage to classification whole genomes. Int. J. Nonlin. Analy. Applica. 2021 2021 12 1531 1539
    [Google Scholar]
  7. Liao B. Xiang Q. Cai L. Cao Z. A new graphical coding of DNA sequence and its similarity calculation. Physica A 2013 392 19 4663 4667 10.1016/j.physa.2013.05.015
    [Google Scholar]
  8. Kha Q.H. Ho Q.T. Le N.Q.K. Identifying snare proteins using an alignment-free method based on multiscan convolutional neural network and pssm profiles. J. Chem. Inf. Model. 2022 62 19 4820 4826 10.1021/acs.jcim.2c01034 36166351
    [Google Scholar]
  9. Gupta M.K. Niyogi R. Misra M. An alignment-free method to find similarity among protein sequences via the general form of Chou’s pseudo amino acid composition. SAR QSAR Environ. Res. 2013 24 7 597 609 10.1080/1062936X.2013.773378 23710804
    [Google Scholar]
  10. Li Y. Song T. Yang J. Zhang Y. Yang J. An alignment-free algorithm in comparing the similarity of protein sequences based on pseudo-markov transition probabilities among amino acids. PLoS One 2016 11 12 e0167430 10.1371/journal.pone.0167430 27918587
    [Google Scholar]
  11. Saw A.K. Tripathy B.C. Nandi S. Alignment-free similarity analysis for protein sequences based on fuzzy integral. Sci. Rep. 2019 9 1 2775 10.1038/s41598‑019‑39477‑8 30808983
    [Google Scholar]
  12. Zhao Y. Xue X. Xie X. An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison. Comput. Biol. Chem. 2019 80 10 15 10.1016/j.compbiolchem.2019.01.005 30851619
    [Google Scholar]
  13. Löchel H.F. Heider D. Chaos game representation and its applications in bioinformatics. Comput. Struct. Biotechnol. J. 2021 19 6263 6271 10.1016/j.csbj.2021.11.008 34900136
    [Google Scholar]
  14. Randić M. Novič M. Plavšić D. Milestones in graphical bioinformatics. Int. J. Quantum Chem. 2013 113 22 2413 2446 10.1002/qua.24479
    [Google Scholar]
  15. Nandy A. A new graphical representation and analysis of DNA sequence structure. I: Methodology and application to globin genes. Curr. Sci. 1994 66 309 314
    [Google Scholar]
  16. Nandy A. Dey S. Basak S. Bielińska-Wąż D. Wąż P. Characterizing the zika virus genome – A bioinformatics study. Curr. Computeraided Drug Des. 2016 12 2 87 97 10.2174/1573409912666160401115812 27032927
    [Google Scholar]
  17. Randić M. Vračko M. Lerš N. Plavšić D. Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation. Chem. Phys. Lett. 2003 371 1-2 202 207 10.1016/S0009‑2614(03)00244‑6
    [Google Scholar]
  18. Randić M. Zupan J. Vikić-Topić D. On representation of proteins by star-like graphs. J. Mol. Graph. Model. 2007 26 1 290 305 10.1016/j.jmgm.2006.12.006 17223597
    [Google Scholar]
  19. Cao Z. Liao B. Li R. A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int. J. Quantum Chem. 2008 108 9 1485 1490 10.1002/qua.21698
    [Google Scholar]
  20. Jafarzadeh N. Iranmanesh A. C-curve: A novel 3D graphical representation of DNA sequence based on codons. Math. Biosci. 2013 241 2 217 224 10.1016/j.mbs.2012.11.009 23246806
    [Google Scholar]
  21. Mu Z.C. Li G.J. Wu H.Y. Qi X.Q. 3D-PAF curve. A novel graphical representation of protein sequences for similarity analysis. MATCH Commun. Math. Comput. Chem. 2016 75 447 462
    [Google Scholar]
  22. Bielińska-Wąż D. Wąż P. Spectral-dynamic representation of DNA sequences. J. Biomed. Inform. 2017 72 1 7 10.1016/j.jbi.2017.06.001 28587890
    [Google Scholar]
  23. Zhang Y. Wen J. Similarity analysis of protein sequences based on a new graphical representation method. Commun. Inf. Syst. 2018 18 3 193 208 10.4310/CIS.2018.v18.n3.a4
    [Google Scholar]
  24. Mahmoodi-Reihani M. Abbasitabar F. Zare-Shahabadi V. A novel graphical representation and similarity analysis of protein sequences based on physicochemical properties. Physica A 2018 510 477 485 10.1016/j.physa.2018.07.011
    [Google Scholar]
  25. Li C. Dai Q. He P. A time series representation of protein sequences for similarity comparison. J. Theor. Biol. 2022 538 111039 10.1016/j.jtbi.2022.111039 35085534
    [Google Scholar]
  26. Qi Z. Ning Y. Huang Y. Protein sequence comparison method based on 3-ary huffman coding. Match (Mulh.) 2023 90 2 357 380 10.46793/match.90‑2.357Q
    [Google Scholar]
  27. Nandy A. Mapping biomolecular sequences: Graphical representations - Their origins, applications and future prospects. Comb. Chem. High Throughput Screen. 2022 25 3 354 364 10.2174/1386207324666210510164743 33970841
    [Google Scholar]
  28. Bielińska A. Majkowicz M. Wąż P. Bielińska-Wąż D. Mathematical modeling: Interdisciplinary similarity studies. Numerical Methods and Applications. Nikolov G. Kolkovska N. Georgiev K. Cham Springer 2019 11189 8 37 10.1007/978‑3‑030‑10692‑8_37
    [Google Scholar]
  29. Wąż P. Bielińska-Wąż D. Moments of inertia of spectra and distribution moments as molecular descriptors. MATCH Commun. Math. Comput. Chem. 2013 70 851 865
    [Google Scholar]
  30. Bielińska A. Wa̧ż P. Bielińska-Wa̧ż D. A computational model of similarity analysis in quality of life research: An example of studies in poland. Life (Basel) 2022 12 1 56 10.3390/life12010056 35054449
    [Google Scholar]
  31. Bielińska-Wąż D. Wąż P. Non-standard bioinformatics characterization of SARS-CoV-2. Comput. Biol. Med. 2021 131 104247 10.1016/j.compbiomed.2021.104247 33611129
    [Google Scholar]
  32. Czerniecka A. Bielińska-Wąż D. Wąż P. Clark T. 20D-dynamic representation of protein sequences. Genomics 2016 107 1 16 23 10.1016/j.ygeno.2015.12.003 26705741
    [Google Scholar]
  33. Bielińska-Wąż D. Wąż P. Błaczkowska A. Mandrysz J. Lass A. Gładysz P. Karamon J. Mathematical modeling in bioinformatics: Application of an alignment-free method combined with principal component analysis. Symmetry (Basel) 2024 16 8 967 10.3390/sym16080967
    [Google Scholar]
  34. Hartigan J.A. Wong M.A. Algorithm AS 136: A K-means clustering algorithm. Appl. Stat. 1979 28 1 100 108 10.2307/2346830
    [Google Scholar]
  35. Kassambara A. Mundt F. Extract and visualize the results of multivariate data analyses. 2020 Available from: https://CRAN.R-project.org/package=factoextra
  36. Wąż P. Bielińska-Wąż D. Nandy A. Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences. J. Math. Chem. 2014 52 1 132 140 10.1007/s10910‑013‑0249‑1 32214592
    [Google Scholar]
  37. Panas D. Wąż P. Bielińska–Wąż D. Nandy A. Basak S.C. 2D–dynamic representation of dna/rna sequences as a characterization tool of the zika virus genome. MATCH Commun. Math. Comput. Chem. 2017 77 321 332
    [Google Scholar]
  38. Hou W. Pan Q. Peng Q. He M. A new method to analyze protein sequence similarity using dynamic time warping. Genomics 2017 109 2 123 130 10.1016/j.ygeno.2016.12.002 27974244
    [Google Scholar]
  39. Chen J. Hu C. Chen L. Tang L. Zhu Y. Xu X. Chen L. Gao H. Lu X. Yu L. Dai X. Xiang C. Li L. Clinical study of mesenchymal stem cell treatment for acute respiratory distress syndrome induced by epidemic influenza a (h7n9) infection: A hint for COVID-19 treatment. Engineering (Beijing) 2020 6 10 1153 1161 10.1016/j.eng.2020.02.006 32292627
    [Google Scholar]
  40. Skowronski D.M. Chuang E.S.Y. Sabaiduc S. Kaweski S.E. Kim S. Dickinson J.A. Olsha R. Gubbay J.B. Zelyas N. Charest H. Bastien N. Jassem A.N. De Serres G. Vaccine effectiveness estimates from an early-season influenza A(H3N2) epidemic, including unique genetic diversity with reassortment, Canada, 2022/23. Euro Surveill. 2023 28 5 2300043 10.2807/1560‑7917.ES.2023.28.5.2300043 36729117
    [Google Scholar]
  41. Braga J.U. Ribeiro A.F. Biological, social, and healthcare factors for death due to influenza A(H1N1) during the 2009 epidemic in Brazil. Rev. Saude Publica 2024 58 1 32 10.11606/s1518‑8787.2024058005855 39140514
    [Google Scholar]
  42. Kim J.Y. Jeong S. Kim D.W. Lee D.W. Lee D.H. Kim D. Kwon J.H. Genomic epidemiology of highly pathogenic avian influenza a (h5n1) virus in wild birds in South Korea during 2021–2022: Changes in viral epidemic patterns. Virus Evol. 2024 10 1 veae014 10.1093/ve/veae014 38455682
    [Google Scholar]
  43. The R Project for Statistical Computing. 2024 Available from: https://www.R-project.org/ (Accessed on: 31-10-2024).
  44. Maechler M. Rousseeuw P. Struyf A. Hubert M. Hornik K. Cluster analysis basics and extensions. Package ‘Cluster’ 2024 1 1 82
    [Google Scholar]
  45. Beaudoin C.A. Kohli M. Salvage S.C. Liu H. Arundel S.J. Hamaia S.W. Lei M. Huang C.L.H. Jackson A.P. Isoform-specific N-linked glycosylation of NaV channel α-subunits alters β-subunit binding sites. J. Gen. Physiol. 2025 157 1 e202413609 10.1085/jgp.202413609 39680039
    [Google Scholar]
/content/journals/cchts/10.2174/0113862073359729250220131623
Loading
/content/journals/cchts/10.2174/0113862073359729250220131623
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test