Skip to content
2000
image of PredART: Uncertainty-quantified Machine Learning Prediction of Androgen Receptor Agonists Overcoming Imbalanced Dataset

Abstract

Aim

This study aims to develop and validate a machine learning-based model for the accurate prediction of androgen receptor (AR) agonistic toxicity, addressing the challenges posed by data imbalance in existing predictive models.

Background

Anomalous agonistic activity of the androgen receptor is a known major indicator of reproductive toxicity, which can lead to prostate cancer. Machine learning-based models have been developed for the rapid prediction of such agonists. However, the existing models have exhibited biased learning outcomes and low sensitivity due to the imbalance in the available training data. In the early screening process of drug discovery, low sensitivity caused by data imbalance can hinder the detection of potentially toxic compounds.

Objective

The objective of this study is to develop a machine learning prediction model that classifies whether a drug candidate is an androgen receptor agonist or not with highly balanced performance compared to existing models.

Methods

PredART is a bootstrap aggregated k-nearest neighbor model for the balanced prediction of androgen receptor agonistic toxicity using 381 active and 8,089 inactive datasets with structural features of them.

Result

In this work, we propose an advanced model that combines the bootstrap aggregating algorithm with machine learning binary classifiers to identify androgen receptor-based reproductive toxicity while avoiding biased prediction results. The optimal model using k-nearest neighbor classifiers achieved an accuracy of 0.831, positive predictive value (PPV) of 0.882, sensitivity of 0.625, specificity of 0.951, Mathews correlation coefficient (MCC) of 0.633 on external test data, demonstrating a significant improvement in sensitivity compared to the previous study and achieving balanced learning. Furthermore, by calculating the standard deviation among outputs of the classifiers and employing this prediction uncertainty as a screening metric to select reliable predictions, the model's performance could be further enhanced.

Conclusion

Based on the bootstrap aggregating algorithm, our prediction model effectively addressed data imbalance while evaluating the performance of various machine learning and deep learning classifiers for a benchmark. Additionally, by quantifying uncertainty, our model provided an intuitive assessment of prediction reliability during large-scale screening processes.

Loading

Article metrics loading...

/content/journals/cbio/10.2174/0115748936355551241220190451
2025-01-02
2025-05-09
Loading full text...

Full text loading...

References

  1. Brinkmann A.O. Molecular basis of androgen insensitivity. Mol. Cell. Endocrinol. 2001 179 1-2 105 109 10.1016/S0303‑7207(01)00466‑X 11420135
    [Google Scholar]
  2. McPhaul M.J. Marcelli M. Tilley W.D. Griffin J.E. Wilson J.D. Androgen resistance caused by mutations in the androgen receptor gene. FASEB J. 1991 5 14 2910 2915 10.1096/fasebj.5.14.1752359 1752359
    [Google Scholar]
  3. Siegel R. Naishadham D. Jemal A. Cancer statistics, 2013. CA Cancer J. Clin. 2013 63 1 11 30 10.3322/caac.21166 23335087
    [Google Scholar]
  4. Heinlein C.A. Chang C. Androgen receptor in prostate cancer. Endocr. Rev. 2004 25 2 276 308 10.1210/er.2002‑0032 15082523
    [Google Scholar]
  5. Tan M.H.E. Li J. Xu H.E. Melcher K. Yong E. Androgen receptor: Structure, role in prostate cancer and drug discovery. Acta Pharmacol. Sin. 2015 36 1 3 23 10.1038/aps.2014.18 24909511
    [Google Scholar]
  6. Lynch C. Sakamuru S. Huang R. Stavreva D.A. Varticovski L. Hager G.L. Judson R.S. Houck K.A. Kleinstreuer N.C. Casey W. Paules R.S. Simeonov A. Xia M. Identifying environmental chemicals as agonists of the androgen receptor by using a quantitative high-throughput screening platform. Toxicology 2017 385 48 58 10.1016/j.tox.2017.05.001 28478275
    [Google Scholar]
  7. Ng H.W. Zhang W. Shu M. Luo H. Ge W. Perkins R. Tong W. Hong H. Competitive molecular docking approach for predicting estrogen receptor subtype α agonists and antagonists. BMC bioinformatics. 2014 15 11 1 15 10.1186/1471‑2105‑15‑S11‑S4
    [Google Scholar]
  8. Yan L. Zhang Q. Huang F. Nie W.W. Hu C.Q. Ying H.Z. Dong X.W. Zhao M.R. Ternary classification models for predicting hormonal activities of chemicals via nuclear receptors. Chem. Phys. Lett. 2018 706 360 366 10.1016/j.cplett.2018.06.022
    [Google Scholar]
  9. Manganelli S. Roncaglioni A. Mansouri K. Judson R.S. Benfenati E. Manganaro A. Ruiz P. Development, validation and integration of in silico models to identify androgen active chemicals. Chemosphere 2019 220 204 215 10.1016/j.chemosphere.2018.12.131 30584954
    [Google Scholar]
  10. Mansouri K. Kleinstreuer N. Abdelaziz A.M. Alberga D. Alves V.M. Andersson P.L. Andrade C.H. Bai F. Balabin I. Ballabio D. Benfenati E. Bhhatarai B. Boyer S. Chen J. Consonni V. Farag S. Fourches D. García-Sosa A.T. Gramatica P. Grisoni F. Grulke C.M. Hong H. Horvath D. Hu X. Huang R. Jeliazkova N. Li J. Li X. Liu H. Manganelli S. Mangiatordi G.F. Maran U. Marcou G. Martin T. Muratov E. Nguyen D.T. Nicolotti O. Nikolov N.G. Norinder U. Papa E. Petitjean M. Piir G. Pogodin P. Poroikov V. Qiao X. Richard A.M. Roncaglioni A. Ruiz P. Rupakheti C. Sakkiah S. Sangion A. Schramm K.W. Selvaraj C. Shah I. Sild S. Sun L. Taboureau O. Tang Y. Tetko I.V. Todeschini R. Tong W. Trisciuzzi D. Tropsha A. Van Den Driessche G. Varnek A. Wang Z. Wedebye E.B. Williams A.J. Xie H. Zakharov A.V. Zheng Z. Judson R.S. CoMPARA: collaborative modeling project for androgen receptor activity. Environ. Health Perspect. 2020 128 2 027002 10.1289/EHP5580 32074470
    [Google Scholar]
  11. Cáceres E.L. Tudor M. Cheng A.C. Deep learning approaches in predicting ADMET properties. Future Med. Chem. 2020 12 22 1995 1999 10.4155/fmc‑2020‑0259 33124448
    [Google Scholar]
  12. Ferreira L.L.G. Andricopulo A.D. ADMET modeling approaches in drug discovery. Drug Discov. Today 2019 24 5 1157 1165 10.1016/j.drudis.2019.03.015 30890362
    [Google Scholar]
  13. Wong L. Wang L. You Z.H. Yuan C.A. Huang Y.A. Cao M.Y. GKLOMLI: A link prediction model for inferring miRNA–lncRNA interactions by using Gaussian kernel-based method on network profile and linear optimization algorithm. BMC Bioinformatics 2023 24 1 188 10.1186/s12859‑023‑05309‑w 37158823
    [Google Scholar]
  14. Wei M. Wang L. Li Y. Li Z. Zhao B. Su X. Wei Y. You Z. BioKG-CMI: A multi-source feature fusion model based on biological knowledge graph for predicting circRNA-miRNA interactions. Sci. China Inf. Sci. 2024 67 8 189104 10.1007/s11432‑024‑4098‑3
    [Google Scholar]
  15. Guo L.X. Wang L. You Z.H. Yu C.Q. Hu M.L. Zhao B.W. Li Y. Likelihood-based feature representation learning combined with neighborhood information for predicting circRNA–miRNA associations. Brief. Bioinform. 2024 25 2 bbae020 10.1093/bib/bbae020 38324624
    [Google Scholar]
  16. Dulsat J. López-Nieto B. Estrada-Tejedor R. Borrell J.I. Evaluation of free online ADMET tools for academic or small biotech environments. Molecules 2023 28 2 776 10.3390/molecules28020776 36677832
    [Google Scholar]
  17. Xiong G. Wu Z. Yi J. Fu L. Yang Z. Hsieh C. Yin M. Zeng X. Wu C. Lu A. Chen X. Hou T. Cao D. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. 2021 49 W1 W5 W14 10.1093/nar/gkab255 33893803
    [Google Scholar]
  18. Yu M.S. Lee J. Lee Y. Na D. 2-D chemical structure image-based in silico model to predict agonist activity for androgen receptor. BMC Bioinformatics 2020 21 S5 Suppl. 5 245 10.1186/s12859‑020‑03588‑1 33106158
    [Google Scholar]
  19. Idakwo G. Thangapandian S. Luttrell J. Li Y. Wang N. Zhou Z. Hong H. Yang B. Zhang C. Gong P. Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J. Cheminform. 2020 12 1 66 10.1186/s13321‑020‑00468‑x 33372637
    [Google Scholar]
  20. Cai X. Lai H. Wang X. Wang L. Liu W. Wang Y. Wang Z. Cao D. Zeng X. Comprehensive evaluation of molecule property prediction with ChatGPT. Methods 2024 222 133 141 10.1016/j.ymeth.2024.01.004 38242382
    [Google Scholar]
  21. Snow O. Lallous N. Ester M. Cherkasov A. Deep learning modeling of androgen receptor responses to prostate cancer therapies. Int. J. Mol. Sci. 2020 21 16 5847 10.3390/ijms21165847 32823970
    [Google Scholar]
  22. Huang R. Xia M. Nguyen D.T. Zhao T. Sakamuru S. Zhao J. Shahane S.A. Rossoshek A. Simeonov A. Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to environmental chemicals and drugs. Front. Environ. Sci. 2016 3 85 10.3389/fenvs.2015.00085
    [Google Scholar]
  23. Piir G. Sild S. Maran U. Binary and multi-class classification for androgen receptor agonists, antagonists and binders. Chemosphere 2021 262 128313 10.1016/j.chemosphere.2020.128313 33182081
    [Google Scholar]
  24. Breiman L. Bagging predictors. Mach. Learn. 1996 24 2 123 140 10.1007/BF00058655
    [Google Scholar]
  25. Begoli E. Bhattacharya T. Kusnezov D. The need for uncertainty quantification in machine-assisted medical decision making. Nat. Mach. Intell. 2019 1 1 20 23 10.1038/s42256‑018‑0004‑1
    [Google Scholar]
  26. Mayr A. Klambauer G. Unterthiner T. Hochreiter S. DeepTox: toxicity prediction using deep learning. Front. Environ. Sci. 2016 3 80 10.3389/fenvs.2015.00080
    [Google Scholar]
  27. CDDI database https://www.cortellis.com/drugdiscovery
  28. Mansouri K. Abdelaziz A. Rybacka A. Roncaglioni A. Tropsha A. Varnek A. Zakharov A. Worth A. Richard A.M. Grulke C.M. Trisciuzzi D. Fourches D. Horvath D. Benfenati E. Muratov E. Wedebye E.B. Grisoni F. Mangiatordi G.F. Incisivo G.M. Hong H. Ng H.W. Tetko I.V. Balabin I. Kancherla J. Shen J. Burton J. Nicklaus M. Cassotti M. Nikolov N.G. Nicolotti O. Andersson P.L. Zang Q. Politi R. Beger R.D. Todeschini R. Huang R. Farag S. Rosenberg S.A. Slavov S. Hu X. Judson R.S. CERAPP: Collaborative estrogen receptor activity prediction project. Environ. Health Perspect. 2016 124 7 1023 1033 10.1289/ehp.1510267 26908244
    [Google Scholar]
  29. Rogers D. Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010 50 5 742 754 10.1021/ci100050t 20426451
    [Google Scholar]
  30. Landrum G RDKit Documentation 2013 Available from: https://ftp.ccp4.ac.uk/ccp4/7.0/unpacked/checkout/rdkit-Release_2015_03_1/Docs/Book/RDKit.pdf
  31. Braverman V. Ostrovsky R. Zaniolo C. Optimal sampling from sliding windows. Twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems Providence, Rhode Island, USA, 2009. pp. 147-156 10.1145/1559795.1559818
    [Google Scholar]
  32. Hearst M.A. Dumais S.T. Osuna E. Platt J. Scholkopf B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998 13 4 18 28 10.1109/5254.708428
    [Google Scholar]
  33. Breiman L. Random forests. Mach. Learn. 2001 45 1 5 32 10.1023/A:1010933404324
    [Google Scholar]
  34. Ke G. Meng Q. Finley T. Wang T. Chen W. Ma W. Ye Q. Liu T-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017 3149 3157
    [Google Scholar]
  35. Chen T. Guestrin C. Xgboost: A scalable tree boosting system. 22nd acm sigkdd international conference on knowledge discovery and data mining San Francisco, California, USA, 2016, pp. 785–794
    [Google Scholar]
  36. Prokhorenkova L. Gusev G. Vorobev A. Dorogush A.V. Gulin A. CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems. MIT Pres 2017
    [Google Scholar]
  37. Peterson L. K-nearest neighbor. Scholarpedia J. 2009 4 2 1883 10.4249/scholarpedia.1883
    [Google Scholar]
  38. Menard S. Applied logistic regression analysis. Sage 2002 10.4135/9781412983433
    [Google Scholar]
  39. MacKay DJ Introduction to Gaussian processes. NATO ASI series F comput. syst sci. 1998 168 133 168
    [Google Scholar]
  40. Gilmer J. Schoenholz S.S. Riley P.F. Vinyals O. Dahl G.E. Neural message passing for quantum chemistry. 34th International Conference on Machine Learning Sydney, NSW, Australia, :2017. PMLR: pp. 1263-1272
    [Google Scholar]
  41. Song Y. Zheng S. Niu Z. Fu Z-H. Lu Y. Yang Y. Communicative Representation Learning on Attributed Molecular Graphs. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization. 2020 2831 2838 10.24963/ijcai.2020/392
    [Google Scholar]
  42. Bergstra J. Bardenet R. Bengio Y. Kégl B. Algorithms for hyper-parameter optimization. Adv. Neural Inf. Process. Syst. 2011 24
    [Google Scholar]
  43. Wei Q. Dunbrack R.L. Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS One 2013 8 7 e67863 10.1371/journal.pone.0067863 23874456
    [Google Scholar]
  44. Chicco D. Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2020 21 1 6 10.1186/s12864‑019‑6413‑7 31898477
    [Google Scholar]
  45. Li F. Xie Q. Li X. Li N. Chi P. Chen J. Wang Z. Hao C. Hormone activity of hydroxylated polybrominated diphenyl ethers on human thyroid receptor-β: In vitro and in silico investigations. Environ. Health Perspect. 2010 118 5 602 606 10.1289/ehp.0901457 20439171
    [Google Scholar]
  46. Ding D. Xu L. Fang H. Hong H. Perkins R. Harris S. Bearden E.D. Shi L. Tong W. The EDKB: An established knowledge base for endocrine disrupting chemicals. BMC bioinformatics. 2010 11 1 7 10.1186/1471‑2105‑11‑S6‑S5
    [Google Scholar]
  47. Ryu S. Kwon Y. Kim W.Y. A Bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification. Chem. Sci. (Camb.) 2019 10 36 8438 8446 10.1039/C9SC01992H 31803423
    [Google Scholar]
  48. Hüllermeier E. Waegeman W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Mach. Learn. 2021 110 3 457 506 10.1007/s10994‑021‑05946‑3
    [Google Scholar]
  49. Weiss K. Khoshgoftaar T.M. Wang D. A survey of transfer learning. J. Big Data 2016 3 1 9 10.1186/s40537‑016‑0043‑6
    [Google Scholar]
  50. Ma M. Ren J. Zhao L. Tulyakov S. Wu C. Peng X. Smil: Multimodal learning with severely missing modality. Proceedings of the AAAI Conference on Artificial Intelligence May 2021, pp. 2302-2310 10.1609/aaai.v35i3.16330
    [Google Scholar]
  51. Tutsoy O. Balikci K. Ozdil N.F. Unknown uncertainties in the COVID-19 pandemic: Multi-dimensional identification and mathematical modelling for the analysis and estimation of the casualties. Digit. Signal Process. 2021 114 103058 10.1016/j.dsp.2021.103058 33879984
    [Google Scholar]
/content/journals/cbio/10.2174/0115748936355551241220190451
Loading
/content/journals/cbio/10.2174/0115748936355551241220190451
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test