Enhancing Arabic Web Article Classification through Label Noise Reduction and Class-Balanced Training

Main Article Content

Mohammed Ahmed Aledresi
Akram Alsubari

Abstract

We present an applied, data-centric study on improving large-scale Arabic web-article classification by detecting and correcting label noise in a 451K-article corpus. Building on a practical AraBERTv02 fine-tuned baseline, we develop an automated noise-detection and relabelling pipeline that combines model confidence, MC-dropout ensemble agreement, and probabilistic flagging inspired by Confident Learning. Using stratified reduced experiments (Train=50k, Val=5k, Test=5k) we automatically relabelled 880 training examples and retrained the classifier. On the held-out 5k test set the cleaned model improved Macro-F1 from 0.9367 (baseline) to 0.9503, a statistically significant gain confirmed by McNemar’s test (p ≈ 1.9e-5) and bootstrap 95% CI for Δ Macro-F1 = [0.0082, 0.0190]. We analyse per-class gains, error modes, and cost trade-offs of automated relabelling versus manual review, and release anonymized preprocessing scripts and experiment artifacts to support reproducibility. Our results show that conservative automatic relabelling — judiciously combined with uncertainty estimation — can yield meaningful, reproducible improvements for large-scale Arabic article classification.                       

Article Details

How to Cite
Aledresi, M. A., & Alsubari, A. (2026). Enhancing Arabic Web Article Classification through Label Noise Reduction and Class-Balanced Training. International Journal on Advanced Electrical and Computer Engineering, 15(1S), 110–118. Retrieved from https://journals.mriindia.com/index.php/ijaece/article/view/1347
Section
Articles

Most read articles by the same author(s)

Similar Articles

1 2 3 4 > >> 

You may also start an advanced similarity search for this article.