Full text loading...
-
The Use of Gene Expression Profiling to Predict Molecular Subtypes of Breast Cancer by a New Machine Learning Algorithm: Random Forest
-
-
- 02 Mar 2024
- 10 Jul 2024
- 14 Oct 2024
Abstract
One of the main causes of cancer-related mortality in women is breast cancer [BC]. There were four molecular subtypes of this malignancy, and adjuvant therapy efficacy differed based on these subtypes. Gene expression profiles provide valuable information that is helpful for patients whose prognosis is not clear from clinical markers and immunohistochemistry.
In this study, we aim to predict molecular types of BC using a gene expression dataset of patients with BC and normal samples using six well-known ensemble machine-learning techniques.
Two microarray datasets were downloaded; [GSE45827] and [GSE140494] from the Gene Expression Omnibus [GEO] database. These datasets comprise 21 samples of normal tissues that were part of a cohort analysis of primary invasive breast cancer [57 basal, 36 HER2, 56 Luminal A, and 66 Luminal B]. Namely, we used AdaBoost, Random Forest [RF], Artificial Neural Network [ANN], Naïve Bayes [NB], Classification and Regression Tree [CART], and Linear Discriminant Analysis [LDA] classifiers.
The results of the data analysis show that the RF and NB classifiers outperform the other models in the prediction of the BC subtype. The RF shows superior performance with an accuracy range between 0.89 and 1.0 in contrast to its competitor NB, which has an average accuracy of 0.91. Our approach perfectly discriminates un-affected cases [normal] from the carcinoma. In this case, the RF provides perfect prediction with zero errors. Additionally, we used PCA, DHWT low-frequency, and DHWT high-frequency to perform a dimensional reduction for the numerous gene expression values. Consequently, the LDA achieves up to 95% improvement in performance through data reduction. Moreover, feature selection allowed for the best performance, which is recorded by the RF with classification accuracy 98%.
Overall, we provide a successful framework that leads to shorter computation times and smaller ML models, especially where memory and time restrictions are crucial.