Skip to content
2000
Volume 19, Issue 2
  • ISSN: 2666-2558
  • E-ISSN: 2666-2566

Abstract

Background

Unbalanced datasets present a significant challenge in machine learning, often leading to biased models that favor the majority class. Recent oversampling techniques like SMOTE, Borderline SMOTE, and ADASYN attempt to mitigate these issues. This study investigates these techniques in conjunction with machine learning models like SVM, decision tree, and logistic regression. The results reveal critical challenges such as noise amplification and overfitting, which we address by refining the oversampling approaches to improve model performance and generalization.

Aim

In order to address this challenge of unbalanced datasets, the minority class is oversampled to accommodate the majority class. Oversampling techniques such SMOTE (Synthetic Minority Oversampling Technique), Borderline SMOTE and ADASYN (Adaptive Synthetic Sampling) are used in this work.

Objective

To perform the comprehensive analysis of various oversampling methods for taking care of class imbalance issue using ML methods.

Methods

The proposed methodology uses BERT technique which removes the pre-processing step. Various proposed oversampling techniques in the literature are used for balancing the data, followed by feature extraction and text classification using ML algorithms. Experiments are performed using ML classification algorithms like Decision tree (DT), Logistic regression (LR), Support vector machine (SVM) and Random forest (RF) for categorizing the data.

Results

The results show improvement corresponding SVM using Borderline SMOTE, resulting in an accuracy of 71.9% and MCC value of 0.53.

Conclusion

The suggested method assists in the evolution of fairer and more effective ML models by addressing this basic issue of class imbalance.

Loading

Article metrics loading...

/content/journals/rascs/10.2174/0126662558347788241127051934
2024-12-10
2026-02-22
Loading full text...

Full text loading...

References

  1. KaurH. A systematic review on imbalanced data challenges in machine learning: Applications and solutionsACM Computing Surveys (CSUR)2019524136
    [Google Scholar]
  2. HanH. WangW-Y. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning.Advances in Intelligent Computing.Berlin, HeidelbergSpringer200587888710.1007/11538059_91
    [Google Scholar]
  3. GanganwarV. An overview of classification algorithms for imbalanced datasets.Int. J. Emerg. Technol. Adv. Eng.20124247
    [Google Scholar]
  4. WangL. HanM. LiX. ZhangN. ChengH. Review of classification methods on unbalanced data sets.IEEE Access20219646066462810.1109/ACCESS.2021.3074243
    [Google Scholar]
  5. BuiT.T. LeT.T. TOUS: A new techniques for imbalanced data classification.vol. 429, pp. 595–612, 2022.
    [Google Scholar]
  6. ChawlaN.V. BowyerK.W. HallL.O. KegelmeyerW.P. SMOTE: Synthetic minority over-sampling technique.J. Artif. Intell. Res.20021632135710.1613/jair.953
    [Google Scholar]
  7. HeH. BaiY. GarciaE.A. LiS. ADASYN: Adaptive synthetic sampling approach for imbalanced learning2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)Hong Kong200813221328
    [Google Scholar]
  8. LeT. LeeM. ParkJ. BaikS. Oversampling techniques for bankruptcy prediction: Novel features from a transaction dataset.Symmetry20181047910.3390/sym10040079
    [Google Scholar]
  9. SunZ. SongQ. ZhuX. SunH. XuB. ZhouY. A novel ensemble method for classifying imbalanced data.Pattern Recognit.20154851623163710.1016/j.patcog.2014.11.014
    [Google Scholar]
  10. DamaschkM. DönickeT. LuxF. Multiclass text classification on unbalanced, sparse and noisy dataProceedings of the First NLPL Workshop on Deep Learning for Natural Language ProcessingTurku, Finland20195865
    [Google Scholar]
  11. FrankE. BouckaertR.R. Naive Bayes for text classification with unbalanced classes.Lecture Notes in Computer Science Berlin, Heidelberg.Berlin, HeidelbergSpringer200650351010.1007/11871637_49
    [Google Scholar]
  12. LavetiR.N. ManeA.A. PalS.N. Dynamic stacked ensemble with entropy based undersampling for the detection of fraudulent transactions2021 6th International Conference for Convergence in Technology (I2CT)Maharashtra, India20211710.1109/I2CT51068.2021.9417896
    [Google Scholar]
  13. SharmaA. SinghP.K. ChandraR. SMOTified-GAN for class imbalanced pattern classification problems.IEEE Access202210306553066510.1109/ACCESS.2022.3158977
    [Google Scholar]
  14. SisodiaD. SisodiaD.S. Quad division prototype selection-based k-nearest neighbor classifier for click fraud detection from highly skewed user click dataset.Eng. Sci. Tech. Int. J202228101011
    [Google Scholar]
  15. YiX. XuY. HuQ. KrishnamoorthyS. LiW. TangZ. ASN-SMOTE: A synthetic minority oversampling method with adaptive qualified synthesizer selection.Complex & Intelligent Systems2022832247227210.1007/s40747‑021‑00638‑w
    [Google Scholar]
  16. HanM. GuoH. LiJ. WangW. Global-local information based oversampling for multi-class imbalanced data.Int. J. Mach. Learn. Cybern.20231462071208610.1007/s13042‑022‑01746‑w
    [Google Scholar]
  17. RenH. WangJ. DaiJ. ZhuZ. LiuJ. Dynamic balanced domain-adversarial networks for cross-domain fault diagnosis of train bearings.IEEE Trans. Instrum. Meas.20227111210.1109/TIM.2022.3179468
    [Google Scholar]
  18. ZhangY. KangB. HooiB. YanS. FengJ. Deep long-tailed learning: A survey.IEEE Trans. Pattern Anal. Mach. Intell.2023459107951081610.1109/TPAMI.2023.326811837074896
    [Google Scholar]
  19. RezvaniS. WangX. A broad review on class imbalance learning techniques.Appl. Soft Comput.202314311041511041510.1016/j.asoc.2023.110415
    [Google Scholar]
  20. Werner de VargasV. Schneider ArandaJ.A. dos Santos CostaR. da Silva PereiraP.R. Victória BarbosaJ.L. Imbalanced data preprocessing techniques for machine learning: A systematic mapping study.Knowl. Inf. Syst.2023651315710.1007/s10115‑022‑01772‑836405957
    [Google Scholar]
  21. AhmadH. KasasbehB. B. AL-Dabaybah, and E. Rawashdeh, “EFN-SMOTE: An effective oversampling technique for credit card fraud detection by utilizing noise filtering and fuzzy c-means clustering”.IJDNS2023731025103210.5267/j.ijdns.2023.6.003
    [Google Scholar]
  22. BabuK.S. RaoY.N. A study on imbalanced data classification for various applications.Revue d’Intelligence Artificielle2023372517524
    [Google Scholar]
  23. Al-QarniE.A. Al-AsmariG.A. Addressing imbalanced data in network intrusion detection: A review and survey.Int. J. Adv. Comput. Sci. Appl.202415210.14569/IJACSA.2024.0150215
    [Google Scholar]
  24. BelhaouariS.B. IslamA. KassoulK. Al-FuqahaA. BouzerdoumA. Oversampling techniques for imbalanced data in regression.Expert Syst. Appl.202425212411812411810.1016/j.eswa.2024.124118
    [Google Scholar]
  25. ChenW. YangK. YuZ. ShiY. ChenC.L.P. A survey on imbalanced learning: Latest research, applications and future directions.Artif. Intell. Rev.202457613710.1007/s10462‑024‑10759‑6
    [Google Scholar]
  26. SapraT. MeenaS. A novel approach to handle imbalanced dataset in Machine Learning2023 IEEE 8th International Conference for Convergence in Technology (I2CT)Lonavla, India20231510.1109/I2CT57861.2023.10126309
    [Google Scholar]
  27. GhoshK. BellingerC. CorizzoR. BrancoP. KrawczykB. JapkowiczN. The class imbalance problem in deep learning.Mach. Learn.202411374845490110.1007/s10994‑022‑06268‑839221170
    [Google Scholar]
  28. ChowdhuryM.M. AyonR.S. HossainM.S. An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset.Healthcare Analytics2024510029710029710.1016/j.health.2023.100297
    [Google Scholar]
  29. BounabR. ZarourK. GuelibB. KhlifaN. Enhancing medicare fraud detection through machine learning: Addressing class imbalance with SMOTE-ENN.IEEE Access202412543825439610.1109/ACCESS.2024.3385781
    [Google Scholar]
  30. AlamriM. YkhlefM. Hybrid undersampling and oversampling for handling imbalanced credit card data.IEEE Access202412140501406010.1109/ACCESS.2024.3357091
    [Google Scholar]
  31. DarA.W. FarooqS.U. Handling class overlap and imbalance using overlap driven under-sampling with balanced random forest in software defect prediction.Innov. Syst. Softw. Eng.202410.1007/s11334‑024‑00571‑4
    [Google Scholar]
  32. GoswamiS. SinghA.K. A literature survey on various aspect of class imbalance problem in data mining.Multimed. Tools Appl.2024832712610.1007/s11042‑024‑18244‑6
    [Google Scholar]
  33. GamengH.A. GerardoB.B. MedinaR.P. Modified adaptive synthetic SMOTE to improve classification performance in imbalanced datasets2019 IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS)Kuala Lumpur, Malaysia20191510.1109/ICETAS48360.2019.9117287
    [Google Scholar]
  34. LuC. LinS. LiuX. ShiH. Telecom fraud identification based on ADASYN and random forest2020 5th International Conference on Computer and Communication Systems (ICCCS)Shanghai, China202044745210.1109/ICCCS49078.2020.9118521
    [Google Scholar]
  35. BrandtJ. LanzénE. A comparative review of SMOTE and ADASYN in imbalanced data classificationAvailable from: https://www.divaportal.org/smash/record.jsf?pid=diva2:1519153
  36. DeboleF. SebastianiF. Supervised term weighting for automated text categorizationProceedings of the 2003 ACM symposium on Applied computingMelbourne, Australia200378478810.1145/952532.952688
    [Google Scholar]
  37. OkkaliogluM. OkkaliogluB.D. AFE-MERT: Imbalanced text classification with abstract feature extraction.Appl. Intell.2022529103521036810.1007/s10489‑021‑02983‑2
    [Google Scholar]
  38. LiuY. LohH.T. SunA. Imbalanced text classification: A term weighting approach.Expert Syst. Appl.200936169070110.1016/j.eswa.2007.10.042
    [Google Scholar]
  39. RanaS. KanjiR. JainS. Comparison of SVM and Naïve Bayes for Sentiment Classification using BERT data2022 5th International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT)Aligarh, India20221510.1109/IMPACT55510.2022.10029067
    [Google Scholar]
  40. RanaShivani KanjiRakesh JainShruti Automated system for movie review classification using BERTRecent Adv. Comput. Sci. Commun.202316810.2174/2666255816666230507182018
    [Google Scholar]
  41. PrasharN. SoodM. JainS. A novel cardiac arrhythmia processing using machine learning techniquesIJIG202020310.1142/S0219467820500230
    [Google Scholar]
  42. RanaN. ThakurT. JainS. Smart seizure detection system: Machine learning based model in healthcare IoT.Curr. Aging Sci20241710.2174/011874609829861824042910223738706349
    [Google Scholar]
  43. SalauA.O. JainS. Adaptive diagnostic machine learning technique for classification of cell decisions for AKT protein.Informatics in Medicine Unlocked20212310051110.1016/j.imu.2021.100511
    [Google Scholar]
  44. BhardawajF. JainS. CAD system design for two-class brain tumor classification using transfer learning.Curr. Cancer Ther. Rev.202420222323210.2174/1573394719666230816091316
    [Google Scholar]
  45. SagrikaS. CAD System design for pituitary tumor classification based on transfer learning technique.Curr. Med. Imaging202320
    [Google Scholar]
/content/journals/rascs/10.2174/0126662558347788241127051934
Loading
/content/journals/rascs/10.2174/0126662558347788241127051934
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error
Please enter a valid_number test