Skip to content
2000
image of Comprehensive Analysis of Oversampling Techniques for Addressing Class Imbalance Employing Machine Learning Models

Abstract

Background

Unbalanced datasets present a significant challenge in machine learning, often leading to biased models that favor the majority class. Recent oversampling techniques like SMOTE, Borderline SMOTE, and ADASYN attempt to mitigate these issues. This study investigates these techniques in conjunction with machine learning models like SVM, Decision Tree, and Logistic Regression. The results reveal critical challenges such as noise amplification and overfitting, which we address by refining the oversampling approaches to improve model performance and generalization.

Aim

In order to address this challenge of unbalanced datasets, the minority class is oversampled to accommodate the majority class. Oversampling techniques such SMOTE (Synthetic Minority Oversampling Technique), Borderline SMOTE and ADASYN (Adaptive Synthetic Sampling) are used in this work.

Objective

To perform the comprehensive analysis of various oversampling methods for taking acre of class imbalance issue using ML methods.

Method

The proposed methodology uses BERT technique which removes the pre-processing step. Various proposed oversampling techniques in the literature are used for balancing the data, followed by feature extraction followed by text classification using ML algorithms. Experiments are performed using ML classification algorithms like Decision tree (DT), Logistic regression (LR), Support vector machine (SVM) and Random forest (RF) for categorizing the data.

Result

The results show improvement corresponding SVM using Borderline SMOTE, resulting in an accuracy of 71.9% and MCC value of 0.53.

Conclusion

The suggested method assists in the evolution of fairer and more effective ML models by addressing this basic issue of class imbalance.

Loading

Article metrics loading...

/content/journals/rascs/10.2174/0126662558347788241127051934
2024-12-10
2025-01-13
Loading full text...

Full text loading...

References

  1. Kaur H. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR) 2019 52 4 1 36 Vol. 52
    [Google Scholar]
  2. Han H. Wang W-Y. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning Advances in Intelligent Computing Springer, Berlin, Heidelberg, 2005, pp 878–887 10.1007/11538059_91
    [Google Scholar]
  3. Ganganwar V. An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2012 42 47
    [Google Scholar]
  4. Wang L. Han M. Li X. Zhang N. Cheng H. Review of classification methods on unbalanced data sets. IEEE Access 2021 9 64606 64628 10.1109/ACCESS.2021.3074243
    [Google Scholar]
  5. Bui T.T. Le T.T. TOUS: A new techniques for imbalanced data classification 2022 429 595 612
    [Google Scholar]
  6. Chawla N.V. Bowyer K.W. Hall L.O. Kegelmeyer W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002 16 321 357 10.1613/jair.953
    [Google Scholar]
  7. He H. Bai Y. Garcia E.A. Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) Hong Kong, 01-08 June 2008, pp. 1322-1328
    [Google Scholar]
  8. Le T. Lee M. Park J. Baik S. Oversampling techniques for bankruptcy prediction: Novel features from a transaction dataset. Symmetry 2018 10 4 79 10.3390/sym10040079
    [Google Scholar]
  9. Sun Z. Song Q. Zhu X. Sun H. Xu B. Zhou Y. A novel ensemble method for classifying imbalanced data. Pattern Recognit. 2015 48 5 1623 1637 10.1016/j.patcog.2014.11.014
    [Google Scholar]
  10. Damaschk M. Dönicke T. Lux F. Multiclass text classification on unbalanced, sparse and noisy data. Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing Turku, Finland, sep 2019, pp.58-65
    [Google Scholar]
  11. Frank E. Bouckaert R.R. Naive Bayes for text classification with unbalanced classes. Lecture Notes in Computer Science Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 503–510 10.1007/11871637_49
    [Google Scholar]
  12. Laveti R.N. Mane A.A. Pal S.N. Dynamic stacked ensemble with entropy based undersampling for the detection of fraudulent transactions. 2021 6th International Conference for Convergence in Technology (I2CT) Maharashtra, India, 02-04 April 2021, pp. 1-7 10.1109/I2CT51068.2021.9417896
    [Google Scholar]
  13. Sharma A. Singh P.K. Chandra R. SMOTified-GAN for class imbalanced pattern classification problems. IEEE Access 2022 10 30655 30665 10.1109/ACCESS.2022.3158977
    [Google Scholar]
  14. Sisodia D. Sisodia D. S. Quad division prototype selection-based k-nearest neighbor classifier for click fraud detection from highly skewed user click dataset. Eng. Sci. Tech. Int. J 2022 28 101011
    [Google Scholar]
  15. Yi X. Xu Y. Hu Q. Krishnamoorthy S. Li W. Tang Z. ASN-SMOTE: A synthetic minority oversampling method with adaptive qualified synthesizer selection. Complex & Intelligent Systems 2022 8 3 2247 2272 10.1007/s40747‑021‑00638‑w
    [Google Scholar]
  16. Han M. Guo H. Li J. Wang W. Global-local information based oversampling for multi-class imbalanced data. Int. J. Mach. Learn. Cybern. 2023 14 6 2071 2086 10.1007/s13042‑022‑01746‑w
    [Google Scholar]
  17. Ren H. Wang J. Dai J. Zhu Z. Liu J. Dynamic balanced domain-adversarial networks for cross-domain fault diagnosis of train bearings. IEEE Trans. Instrum. Meas. 2022 71 1 12 10.1109/TIM.2022.3179468
    [Google Scholar]
  18. Zhang Y. Kang B. Hooi B. Yan S. Feng J. Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023 45 9 10795 10816 10.1109/TPAMI.2023.3268118 37074896
    [Google Scholar]
  19. Rezvani S. Wang X. A broad review on class imbalance learning techniques. Appl. Soft Comput. 2023 143 110415 110415 10.1016/j.asoc.2023.110415
    [Google Scholar]
  20. Werner de Vargas V. Schneider Aranda J.A. dos Santos Costa R. da Silva Pereira P.R. Victória Barbosa J.L. Imbalanced data preprocessing techniques for machine learning: A systematic mapping study. Knowl. Inf. Syst. 2023 65 1 31 57 10.1007/s10115‑022‑01772‑8 36405957
    [Google Scholar]
  21. Ahmad H. Kasasbeh B. AL-Dabaybah B. Rawashdeh E. EFN-SMOTE: An effective oversampling technique for credit card fraud detection by utilizing noise filtering and fuzzy c-means clustering. IJDNS 2023 7 3 1025 1032 10.5267/j.ijdns.2023.6.003
    [Google Scholar]
  22. Babu K.S. Rao Y.N. A study on imbalanced data classification for various applications Revue d'Intelligence Artificielle 2023 37 2 517 524
    [Google Scholar]
  23. Al-Qarni E.A. Al-Asmari G.A. Addressing imbalanced data in network intrusion detection: A review and survey. Int. J. Adv. Comput. Sci. Appl. 2024 15 2 10.14569/IJACSA.2024.0150215
    [Google Scholar]
  24. Belhaouari S.B. Islam A. Kassoul K. Al-Fuqaha A. Bouzerdoum A. Oversampling techniques for imbalanced data in regression. Expert Syst. Appl. 2024 252 124118 124118 10.1016/j.eswa.2024.124118
    [Google Scholar]
  25. Chen W. Yang K. Yu Z. Shi Y. Chen C.L.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024 57 6 137 10.1007/s10462‑024‑10759‑6
    [Google Scholar]
  26. Sapra T. Meena S. A novel approach to handle imbalanced dataset in Machine Learning 2023 IEEE 8th International Conference for Convergence in Technology (I2CT) Lonavla, India, 07-09 April 2023, pp. 1-5 10.1109/I2CT57861.2023.10126309
    [Google Scholar]
  27. Ghosh K. Bellinger C. Corizzo R. Branco P. Krawczyk B. Japkowicz N. The class imbalance problem in deep learning. Mach. Learn. 2024 113 7 4845 4901 10.1007/s10994‑022‑06268‑8 39221170
    [Google Scholar]
  28. Chowdhury M.M. Ayon R.S. Hossain M.S. An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset. Healthcare Analytics 2024 5 100297 100297 10.1016/j.health.2023.100297
    [Google Scholar]
  29. Bounab R. Zarour K. Guelib B. Khlifa N. Enhancing medicare fraud detection through machine learning: Addressing class imbalance with SMOTE-ENN. IEEE Access 2024 12 54382 54396 10.1109/ACCESS.2024.3385781
    [Google Scholar]
  30. Alamri M. Ykhlef M. Hybrid undersampling and oversampling for handling imbalanced credit card data. IEEE Access 2024 12 14050 14060 10.1109/ACCESS.2024.3357091
    [Google Scholar]
  31. Dar A.W. Farooq S.U. Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect prediction. Innov. Syst. Softw. Eng. 2024 10.1007/s11334‑024‑00571‑4
    [Google Scholar]
  32. Goswami S. Singh A.K. A literature survey on various aspect of class imbalance problem in data mining. Multimed. Tools Appl 2024 83 27 1 26 10.1007/s11042‑024‑18244‑6
    [Google Scholar]
  33. Gameng H.A. Gerardo B.B. Medina R.P. Modified adaptive synthetic SMOTE to improve classification performance in imbalanced datasets. 2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS) Kuala Lumpur, Malaysia, 20-21 December 2019, pp. 1-5 10.1109/ICETAS48360.2019.9117287
    [Google Scholar]
  34. Lu C. Lin S. Liu X. Shi H. Telecom fraud identification based on ADASYN and random forest. 2020 5th International Conference on Computer and Communication Systems (ICCCS) Shanghai, China, 15-18 May 2020, pp. 447-452 10.1109/ICCCS49078.2020.9118521
    [Google Scholar]
  35. Brandt J. Lanzén E. A comparative review of SMOTE and ADASYN in imbalanced data classification 2021 Available from: https://www.divaportal.org/smash/record.jsf?pid=diva2:1519153
  36. Debole F. Sebastiani F. Supervised term weighting for automated text categorization Proceedings of the 2003 ACM symposium on Applied computing Melbourne, 9-12 March 2003, 784-788 10.1145/952532.952688
    [Google Scholar]
  37. Okkalioglu M. Okkalioglu B.D. AFE-MERT: Imbalanced text classification with abstract feature extraction. Appl. Intell. 2022 52 9 10352 10368 10.1007/s10489‑021‑02983‑2
    [Google Scholar]
  38. Liu Y. Loh H.T. Sun A. Imbalanced text classification: A term weighting approach. Expert Syst. Appl. 2009 36 1 690 701 10.1016/j.eswa.2007.10.042
    [Google Scholar]
  39. Rana S. Kanji R. Jain S. Comparison of SVM and Naïve Bayes for Sentiment Classification using BERT data 2022 5th International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT) Aligarh, India, 26-27 November 2022, pp. 1-5 10.1109/IMPACT55510.2022.10029067
    [Google Scholar]
  40. Rana Shivani Kanji Rakesh Jain Shruti Automated system for movie review classification using BERT. Recent Adv. Comput. Sci. Commun. 2023 16 8 10.2174/2666255816666230507182018
    [Google Scholar]
  41. Prashar N. Sood M. Jain S. A novel cardiac arrhythmia processing using Machine Learning techniques IJIG 2020 20 3 10.1142/S0219467820500230
    [Google Scholar]
  42. Rana N. Thakur T. Jain S. Smart seizure detection system: Machine Learning based model in healthcare IoT. Curr. Aging Sci. 2024 17 10.2174/0118746098298618240429102237 38706349
    [Google Scholar]
  43. Salau A.O. Jain S. Adaptive diagnostic machine learning technique for classification of cell decisions for AKT protein. Informatics in Medicine Unlocked 2021 23 100511 10.1016/j.imu.2021.100511
    [Google Scholar]
  44. Bhardawaj F. Jain S. CAD system design for two-class brain tumor classification using transfer learning. Curr. Cancer Ther. Rev. 2024 20 2 223 232 10.2174/1573394719666230816091316
    [Google Scholar]
  45. Sagrika S. CAD System design for pituitary tumor classification based on transfer learning technique. Curr. Med. Imaging 2023 20
    [Google Scholar]
/content/journals/rascs/10.2174/0126662558347788241127051934
Loading
/content/journals/rascs/10.2174/0126662558347788241127051934
Loading

Data & Media loading...


  • Article Type:
    Research Article
Keywords: ADASYN ; SMOTE ; Oversampling ; Borderline SMOTE ; unbalanced data ; machine learning ; SVM
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test