Skip to content
2000
image of NaturePred: A Tool for Revolutionizing Natural Product Classification with Artificial Intelligence

Abstract

Background

The identification and classification of natural products are vital in drug discovery and bioactive compound exploration. Traditional methods are laborious and time-consuming, necessitating innovative tools for accurate predictions using advanced AI techniques.

Objectives

This paper presents NaturePred, a user-friendly tool designed to predict the class of natural products and calculate eight physicochemical properties of protein sequences. It aims to accurately predict five distinct classes of natural product biosynthetic gene clusters (BGCs): Polyketide Synthases (PKS), Non-ribosomal Peptide Synthetases (NRPS), Ribosomally Synthesized and Post-Translationally Modified Peptides (RiPPs), Terpenes, and PKS-NRPS Hybrids. It also addresses reliability in multi-class classification with a 90% confidence score threshold.

Method

NaturePred offers three input options: single protein sequence, CSV file, or GenBank (.gbk) file. It uses a pipeline with a Natural Language Processing model based on TF-IDF (Term Frequency- Inverse Document Frequency) and a Logistic Regression classifier. Predictions are made if the confidence score exceeds 90%; otherwise, “None of the above class” is predicted. Evaluation with unseen data from the MiBIG database shows high accuracy (~96%) in assigning BGCs.

Results

NaturePred provides accurate predictions with high confidence scores, demonstrating reliability across different datasets. It calculates eight physicochemical properties of protein sequences, offering valuable insights for further analysis.

Conclusion

NaturePred's integrated features, including versatile input options, accurate predictions, and physicochemical property calculations, make it an indispensable tool in natural product research. By addressing classification challenges, NaturePred facilitates drug discovery and bioactive compound exploration, advancing the field. Tool available: (http://login1.cabgrid.res.in:5101/).

Loading

Article metrics loading...

/content/journals/cp/10.2174/0115701646322417241101055512
2025-01-01
2025-04-11
Loading full text...

Full text loading...

References

  1. Newman D.J. Cragg G.M. Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. J. Nat. Prod. 2022 85 3 500 516 32162523
    [Google Scholar]
  2. Butler M.S. Robertson A.A.B. Cooper M.A. Natural product and natural product derived drugs in clinical trials. Nat. Prod. Rep. 2014 31 11 1612 1661 10.1039/C4NP00064A 25204227
    [Google Scholar]
  3. Demain A.L. Pharmaceutically active secondary metabolites of microorganisms. Appl. Microbiol. Biotechnol. 1999 52 4 455 463 10.1007/s002530051546 10570792
    [Google Scholar]
  4. Grisoni F. Merk D. Byrne R. Schneider G. Scaffold-hopping from synthetic drugs by holistic molecular representation. Sci. Rep. 2018 8 1 16469 10.1038/s41598‑018‑34677‑0 30405170
    [Google Scholar]
  5. Medema M.H. Blin K. Cimermancic P. de Jager V. Zakrzewski P. Fischbach M.A. Weber T. Takano E. Breitling R. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 2011 39 Web Server issue Suppl. 2 W339 W346 10.1093/nar/gkr466 21672958
    [Google Scholar]
  6. Skinnider M.A. Merwin N.J. Johnston C.W. Magarvey N.A. PRISM 3: expanded prediction of natural product chemical structures from microbial genomes. Nucleic Acids Res. 2017 45 W1 W49 W54 10.1093/nar/gkx320 28460067
    [Google Scholar]
  7. Cimermancic P. Medema M.H. Claesen J. Kurita K. Wieland Brown L.C. Mavrommatis K. Pati A. Godfrey P.A. Koehrsen M. Clardy J. Birren B.W. Takano E. Sali A. Linington R.G. Fischbach M.A. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell 2014 158 2 412 421 10.1016/j.cell.2014.06.034 25036635
    [Google Scholar]
  8. Wambo P.A. ML-Miner: A machine learning tool used for identification of novel biosynthetic gene clusters. 2022
    [Google Scholar]
  9. Medema M.H. Fischbach M.A. Computational approaches to natural product discovery. Nat. Chem. Biol. 2015 11 9 639 648 10.1038/nchembio.1884 26284671
    [Google Scholar]
  10. Mishra D.C. Madival S.D. Sharma A. Kumar S. Maji A.K. Budhlakoti N. Sinha D. Rai A. A deep clustering-based novel approach for binning of metagenomics data. Curr. Genomics 2022 23 5 353 368 10.2174/1389202923666220928150100 36778191
    [Google Scholar]
  11. Van Rossum G. Drake F.L. Python 3 Reference Manual. Scotts Valley, CA CreateSpace 2009
    [Google Scholar]
  12. Pedregosa F. Varoquaux G. Gramfort A. Michel V. Thirion B. Grisel O. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011 12 Oct 2825 2830
    [Google Scholar]
  13. Chollet F. Keras 3: Deep learning for humans. 2015 Available from:https://github.com/fchollet/keras(accessed on 8-10-2024)
  14. Abadi M. Barham P. Chen J. Chen Z. Davis A. Dean J. TensorFlow: A system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) 02 Nov, 2016, USA, pp. 265-283.
    [Google Scholar]
  15. Cock P.J.A. Antao T. Chang J.T. Chapman B.A. Cox C.J. Dalke A. Friedberg I. Hamelryck T. Kauff F. Wilczynski B. de Hoon M.J.L. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 25 11 1422 1423 10.1093/bioinformatics/btp163 19304878
    [Google Scholar]
  16. Harris C.R. Millman K.J. van der Walt S.J. Gommers R. Virtanen P. Cournapeau D. Wieser E. Taylor J. Berg S. Smith N.J. Kern R. Picus M. Hoyer S. van Kerkwijk M.H. Brett M. Haldane A. del Río J.F. Wiebe M. Peterson P. Gérard-Marchant P. Sheppard K. Reddy T. Weckesser W. Abbasi H. Gohlke C. Oliphant T.E. Array programming with NumPy. Nature 2020 585 7825 357 362 10.1038/s41586‑020‑2649‑2 32939066
    [Google Scholar]
  17. McKinney W. Data structures for statistical computing in Python. Proceeding of the 9th Python in Science Conference SCIPY 2010 10.25080/Majora‑92bf1922‑00a
    [Google Scholar]
  18. Kautsar S.A. Blin K. Shaw S. Navarro-Muñoz J.C. Terlouw B.R. van der Hooft J.J.J. van Santen J.A. Tracanna V. Suarez Duran H.G. Pascal Andreu V. Selem-Mojica N. Alanjary M. Robinson S.L. Lund G. Epstein S.C. Sisto A.C. Charkoudian L.K. Collemare J. Linington R.G. Weber T. Medema M.H. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 2020 48 D1 D454 D458 31612915
    [Google Scholar]
  19. Madival S.D. Jha G.K. Mishra D.C. Kumar S. Budhlakoti N. Sharma A. Chaturvedi K.K. Kabilan S. Farooqi M.S. Srivastava S. A novel deep contrastive convolutional autoencoder based binning approach for taxonomic independent metagenomics data. J. Plant Biochem. Biotechnol. 2024 ••• 1 11 10.1007/s13562‑024‑00911‑2
    [Google Scholar]
  20. Mikolov T. Efficient estimation of word representations in vector space. arXiv:1301.3781 2013
    [Google Scholar]
  21. Hosmer D.W. Jr Lemeshow S. Sturdivant R.X. Applied Logistic Regression. John Wiley & Sons 2013 10.1002/9781118548387
    [Google Scholar]
  22. Guo G. Wang H. Bell D. Bi Y. Greer K. KNN model-based approach in classification. On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE - OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003 Catania, Sicily, Italy, Nov 3-7, 2003, pp. 1-12. 10.1007/978‑3‑540‑39964‑3_62
    [Google Scholar]
  23. Lewis D.D. Naive (Bayes) at forty: The independence assumption in information retrieval. European Conference on Machine Learning Springer 1998 4 15 10.1007/BFb0026666
    [Google Scholar]
  24. Loh W.Y. Classification and regression trees. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011 1 1 14 23 10.1002/widm.8
    [Google Scholar]
  25. Breiman L. Random forests. Mach. Learn. 2001 45 1 5 32 10.1023/A:1010933404324
    [Google Scholar]
  26. Cortes C. Vapnik V. Support-vector networks. Mach. Learn. 1995 20 3 273 297 10.1007/BF00994018
    [Google Scholar]
  27. Chen T. Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Aug 13-17, 2016, California, San Francisco, USA, pp. 785-794. 10.1145/2939672.2939785
    [Google Scholar]
  28. Prokhorenkova L. Gusev G. Vorobev A. Dorogush A.V. Gulin A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018 2018 31
    [Google Scholar]
  29. Rosenblatt F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington, DC Spartan Books 1962 Vol. 55
    [Google Scholar]
  30. Chawla N.V. Bowyer K.W. Hall L.O. Kegelmeyer W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002 16 321 357 10.1613/jair.953
    [Google Scholar]
  31. Mishra D.C. Madival S.D. Sharma A. Budhlakoti N. Chaturvedi K.K. Angadi U.B. Enhancing the classification of biosynthetic gene clusters through comprehensive NLP-based approach. Preprints 202310.1564.v1 2023 10.20944/preprints202310.1564.v1
    [Google Scholar]
/content/journals/cp/10.2174/0115701646322417241101055512
Loading
/content/journals/cp/10.2174/0115701646322417241101055512
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test