Skip to content
2000
image of PredPVP: A Stacking Model for Predicting Phage Virion Proteins Based on Feature Selection Methods

Abstract

Background

Phage therapy has a broad application prospect as a novel therapeutic method, and Phage Virion Proteins (PVP) can recognize the host and bind to surface receptors, which is of great significance for the development of antimicrobial drugs for the treatment of infectious diseases caused by bacteria. In recent years, several PVP predictors based on machine learning have been developed, which usually use a single feature to train the learner. In contrast, higher dimensional feature representations tend to contain more potential sequence information.

Methods

In this work, we construct a stacking model PredPVP for PVP prediction by combining multiple features and using feature selection methods. Specifically, the sequence is first encoded using seven features. For this high-dimensional feature representation, three feature selection methods wereutilized to remove redundant features, then integrated with eight machine learning algorithms. Finally, probability features and class features (PCFs) generated by 24 base models were put into logistic regression (LR) to train the model.

Results

The results of the independent test set indicate that PredPVP has higher performance compared to other existing predictors, with an AUC of 93.4%.

Conclusion:

We expect PredPVP to be used as a tool for large-scale PVP recognition, providing a new way for the development of novel antimicrobials and accelerating its application in actual treatment. The datasets and source codes used in this study are available at https://github.com/caoqian23/PredPVP.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936330198240924110742
2024-10-28
2024-11-23
Loading full text...

Full text loading...

References

  1. Clark J.R. March J.B. Bacteriophages and biotechnology: Vaccines, gene therapy and antibacterials. Trends Biotechnol. 2006 24 5 212 218 10.1016/j.tibtech.2006.03.003 16567009
    [Google Scholar]
  2. Hockenberry A.J. Wilke C.O. BACPHLIP: predicting bacteriophage lifestyle from conserved protein domains. PeerJ 2021 9 e11396 10.7717/peerj.11396 33996289
    [Google Scholar]
  3. O’Flaherty S. Ross R.P. Coffey A. Bacteriophage and their lysins for elimination of infectious bacteria. FEMS Microbiol. Rev. 2009 33 4 801 819 10.1111/j.1574‑6976.2009.00176.x 19416364
    [Google Scholar]
  4. Feng P.M. Ding H. Chen W. Lin H. Naïve Bayes classifier with feature selection to identify phage virion proteins. Comput. Math. Methods Med. 2013 2013 1 6 10.1155/2013/530696 23762187
    [Google Scholar]
  5. Lavigne R. Ceyssens P.J. Robben J. Phage proteomics: Applications of mass spectrometry. Methods Mol. Biol. 2009 502 239 251 10.1007/978‑1‑60327‑565‑1_14 19082560
    [Google Scholar]
  6. Jara-Acevedo R. Díez P. González-González M. Dégano R.M. Ibarrola N. Góngora R. Orfao A. Fuentes M. Screening phage-display antibody libraries using protein arrays. Methods Mol. Biol. 2018 1701 365 380 10.1007/978‑1‑4939‑7447‑4_20 29116516
    [Google Scholar]
  7. Seguritan V. Alves N. Jr Arnoult M. Raymond A. Lorimer D. Burgin A.B. Jr Salamon P. Segall A.M. Artificial neural networks trained to detect viral and phage structural proteins. PLOS Comput. Biol. 2012 8 8 e1002657 10.1371/journal.pcbi.1002657 22927809
    [Google Scholar]
  8. Ding H. Feng P.M. Chen W. Lin H. Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol. Biosyst. 2014 10 8 2229 2235 10.1039/C4MB00316K 24931825
    [Google Scholar]
  9. Manavalan B. Shin T.H. Lee G. PVP-SVM: Sequence-based prediction of phage virion proteins using a support vector machine. Front. Microbiol. 2018 9 476 10.3389/fmicb.2018.00476 29616000
    [Google Scholar]
  10. Pan Y. Gao H. Lin H. Liu Z. Tang L. Li S. Identification of bacteriophage virion proteins using Multinomial Naïve Bayes with g-Gap feature tree. Int. J. Mol. Sci. 2018 19 6 1779 10.3390/ijms19061779 29914091
    [Google Scholar]
  11. Tan J.X. Dao F.Y. Lv H. Feng P.M. Ding H. Identifying phage virion proteins by using two-step feature selection methods. Molecules 2018 23 8 2000 10.3390/molecules23082000 30103458
    [Google Scholar]
  12. Ru X. Li L. Wang C. Identification of phage viral proteins with hybrid sequence features. Front. Microbiol. 2019 10 507 10.3389/fmicb.2019.00507 30972038
    [Google Scholar]
  13. Arif M. Ali F. Ahmad S. Kabir M. Ali Z. Hayat M. Pred-BVP-Unb: Fast prediction of bacteriophage Virion proteins using un-biased multi-perspective properties with recursive feature elimination. Genomics 2020 112 2 1565 1574 10.1016/j.ygeno.2019.09.006 31526842
    [Google Scholar]
  14. Charoenkwan P. Kanthawong S. Schaduangrat N. Yana J. Shoombuatong W. PVPred-SCM: Improved Prediction and analysis of phage virion proteins using a scoring card method. Cells 2020 9 2 353 10.3390/cells9020353 32028709
    [Google Scholar]
  15. Zhang L. Zhang C. Gao R. Yang R. An ensemble method to distinguish bacteriophage virion from non-virion proteins based on protein sequence characteristics. Int. J. Mol. Sci. 2015 16 9 21734 21758 10.3390/ijms160921734 26370987
    [Google Scholar]
  16. Charoenkwan P. Nantasenamat C. Hasan M.M. Shoombuatong W. Meta-iPVP: A sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. J. Comput. Aided Mol. Des. 2020 34 10 1105 1116 10.1007/s10822‑020‑00323‑z 32557165
    [Google Scholar]
  17. Han H. Zhu W. Ding C. Liu T. iPVP-MCV: A multi-classifier voting model for the accurate identification of phage virion proteins. Symmetry (Basel) 2021 13 8 1506 10.3390/sym13081506
    [Google Scholar]
  18. Ahmad S. Charoenkwan P. Quinn J.M.W. Moni M.A. Hasan M.M. Lio’ P. Shoombuatong W. SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci. Rep. 2022 12 1 4106 10.1038/s41598‑022‑08173‑5 35260777
    [Google Scholar]
  19. Fang Z. Zhou H. VirionFinder: Identification of complete and partial prokaryote virus virion protein from virome data using the sequence and biochemical properties of amino acids. Front. Microbiol. 2021 12 615711 10.3389/fmicb.2021.615711 33613485
    [Google Scholar]
  20. UniProt Consortium UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019 47 D1 D506 D515 10.1093/nar/gky1049 30395287
    [Google Scholar]
  21. Huang Y. Niu B. Gao Y. Fu L. Li W. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 2010 26 5 680 682 10.1093/bioinformatics/btq003 20053844
    [Google Scholar]
  22. Jiang M. Zhao B. Luo S. Wang Q. Chu Y. Chen T. Mao X. Liu Y. Wang Y. Jiang X. Wei D.Q. Xiong Y. NeuroPpred-Fuse: An interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief. Bioinform. 2021 22 6 bbab310 10.1093/bib/bbab310 34396388
    [Google Scholar]
  23. Xie R. Li J. Wang J. Dai W. Leier A. Marquez-Lago T.T. Akutsu T. Lithgow T. Song J. Zhang Y. DeepVF: A deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief. Bioinform. 2021 22 3 bbaa125 10.1093/bib/bbaa125 32599617
    [Google Scholar]
  24. Cheng CW. Su ECY. Hwang JK. Sung TY. Hsu WL. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics 2008 9 S6 10.1186/1471‑2105‑9‑S12‑S6
    [Google Scholar]
  25. Wang J. Yang B. Revote J. Leier A. Marquez-Lago T.T. Webb G. Song J. Chou K.C. Lithgow T. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 2017 33 17 2756 2758 10.1093/bioinformatics/btx302 28903538
    [Google Scholar]
  26. Yu B. Lou L. Li S. Zhang Y. Qiu W. Wu X. Wang M. Tian B. Prediction of protein structural class for low-similarity sequences using Chou’s pseudo amino acid composition and wavelet denoising. J. Mol. Graph. Model. 2017 76 260 273 10.1016/j.jmgm.2017.07.012 28743071
    [Google Scholar]
  27. Juan E.Y.T. Li W.J. Jhang J.H. Chiu C.H. Predicting protein subcellular localizations for gram-negative bacteria using DP-PSSM and support vector machines. International Conference on Complex, Intelligent and Software Intensive Systems Fukuoka, Japan 2009 836 41 10.1109/CISIS.2009.194
    [Google Scholar]
  28. Zou L. Nan C. Hu F. Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles. Bioinformatics 2013 29 24 3135 3142 10.1093/bioinformatics/btt554 24064423
    [Google Scholar]
  29. Chen Z. Zhao P. Li F. Leier A. Marquez-Lago T.T. Wang Y. Webb G.I. Smith A.I. Daly R.J. Chou K.C. Song J. iFeature : A Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018 34 14 2499 2502 10.1093/bioinformatics/bty140 29528364
    [Google Scholar]
  30. Bin Y. Zhang W. Tang W. Dai R. Li M. Zhu Q. Xia J. Prediction of neuropeptides from sequence information using ensemble classifier and hybrid features. J. Proteome Res. 2020 19 9 3732 3740 10.1021/acs.jproteome.0c00276 32786686
    [Google Scholar]
  31. Kawashima S. Ogata H. Kanehisa M. AAindex: Amino acid index database. Nucleic Acids Res. 1999 27 1 368 369 10.1093/nar/27.1.368 9847231
    [Google Scholar]
  32. Dash M. Liu H. Feature selection for classification. Intell. Data Anal. 1997 1 3 131 156 10.3233/IDA‑1997‑1302
    [Google Scholar]
  33. Song Q. Jiang H. Liu J. Feature selection based on FDA and F-score for multi-class classification. Expert Syst. Appl. 2017 81 22 27 10.1016/j.eswa.2017.02.049
    [Google Scholar]
  34. Henseler J. Ringle C.M. Sarstedt M. A new criterion for assessing discriminant validity in variance-based structural equation modeling. J. Acad. Mark. Sci. 2015 43 1 115 135 10.1007/s11747‑014‑0403‑8
    [Google Scholar]
  35. Li D. Wang Y. Hu W. Chen F. Zhao J. Chen X. Han L. Application of machine learning classifier to Candida auris drug resistance analysis. Front. Cell. Infect. Microbiol. 2021 11 742062 10.3389/fcimb.2021.742062 34722336
    [Google Scholar]
  36. Cover T. Hart P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967 13 1 21 27 10.1109/TIT.1967.1053964
    [Google Scholar]
  37. Uddin S. Haque I. Lu H. Moni M.A. Gide E. Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci. Rep. 2022 12 1 6256 10.1038/s41598‑022‑10358‑x 35428863
    [Google Scholar]
  38. Breiman L. Random Forests. Mach. Learn. 2001 45 1 5 32 10.1023/A:1010933404324
    [Google Scholar]
  39. Zhang C. Zhang Y. Shi X. Almpanidis G. Fan G. Shen X. On incremental learning for gradient boosting decision trees. Neural Process. Lett. 2019 50 1 957 987 10.1007/s11063‑019‑09999‑3
    [Google Scholar]
  40. Chen T. Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining San Francisco 2016 785 794 10.1145/2939672.2939785
    [Google Scholar]
  41. Geurts P. Ernst D. Wehenkel L. Extremely randomized trees. Mach. Learn. 2006 63 1 3 42 10.1007/s10994‑006‑6226‑1
    [Google Scholar]
  42. Rufo D.D. Debelee T.G. Ibenthal A. Negera W.G. Diagnosis of Diabetes Mellitus using gradient boosting machine (LightGBM). Diagnostics (Basel) 2021 11 9 1714 10.3390/diagnostics11091714 34574055
    [Google Scholar]
  43. Yan J. Xu Y. Cheng Q. Jiang S. Wang Q. Xiao Y. Ma C. Yan J. Wang X. LightGBM: Accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 2021 22 1 271 10.1186/s13059‑021‑02492‑y 34544450
    [Google Scholar]
  44. Vapnik V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999 10 5 988 999 10.1109/72.788640 18252602
    [Google Scholar]
  45. Li M. Zhang W. PHIAF: prediction of phage-host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief. Bioinform. 2022 23 1 bbab348 10.1093/bib/bbab348 34472593
    [Google Scholar]
  46. Polikar R. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 2006 6 3 21 45 10.1109/MCAS.2006.1688199
    [Google Scholar]
  47. Maaten Lvd Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 2014 15 1 3221 3245
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936330198240924110742
Loading
/content/journals/cbio/10.2174/0115748936330198240924110742
Loading

Data & Media loading...


  • Article Type:
    Research Article
Keywords: Feature selection ; Machine learning ; Stacking learning ; Phage virion protein
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test