MAAD: A Multi-Label Arabic Dataset for Transformer-Based News Summarization and Classification

Main Article Content

Marwah Yahya Al-Nahari
Ayedh Abdulaziz Mohsen
Nada Abdu Al-Humidi
Akram Alsubari

Abstract

This paper presents a Multi-Label Arabic Articles Dataset (MAAD), a sizable corpus of 602,792 news articles from six prominent Arabic media outlets covering ten subject areas, is presented in this work. The MAAD underwent extensive pre-processing noise filtering and duplicate elimination using hashing with cosine similarity, linguistic normalization, and topic validation through LDA modeling and expert review, achieving 95% categorization accuracy in order to address the lack of high-quality Arabic datasets for deep learning. The multi-label structure of MAAD allows for the simultaneous execution of several NLP tasks, in contrast to conventional single-task corpora. Four transformer models: ArabicT5, AraBART, mT5, and GPT were refined utilizing a single text-to-text architecture for both classification and abstractive summarization in order to evaluate its efficacy. Four standard metrics were used in the review process: F1-score for classification, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU for summarization, accuracy, precision, and recall. Results demonstrated that ArabicT5 outperformed all comparative models, achieving ROUGE-1 = 0.90, ROUGE-2 = 0.90, ROUGE-L = 0.90, BLEU = 0.81, and classification accuracy = 0.98 with consistent scores (Precision, Recall, F1 ≥ 0.95). Furthermore, the model's efficacy in generating coherent and semantically accurate Arabic text was validated by a human examination of the generated summaries, which produced high ratings for Fluency (4.86) and Adequacy (4.35). These results demonstrate that language-specific pretraining greatly enhances model performance on the intricate morphology and syntax of Arabic. Consequently, MAAD serves as a robust foundation and a practical instrument for advancing the field of Arabic Natural Language Processing (NLP). Its utility extends significantly to enhancing the precision of automated journalism, as well as optimizing the workflows involved in news processing and content aggregation.

Article Details

How to Cite
Al-Nahari, M. Y., Mohsen, A. A., Al-Humidi, N. A., & Alsubari , A. (2026). MAAD: A Multi-Label Arabic Dataset for Transformer-Based News Summarization and Classification. International Journal on Advanced Electrical and Computer Engineering, 15(1S), 42–61. Retrieved from https://journals.mriindia.com/index.php/ijaece/article/view/1342
Section
Articles

Similar Articles

1 2 3 4 5 6 7 8 9 > >> 

You may also start an advanced similarity search for this article.