AI Automatic Pronunciation Mistake Detector

Pratham Singh; Rishikesh Singh; Siddhi Sankhe; Narendra Prajapati; Manasi Churi

doi:10.65521/ijacte.v15i1.2626

Authors

Pratham Singh Dept Of Computer Engineering, Shree L.R Tiwari College Of Engineering
Rishikesh Singh Dept Of Computer Engineering, Shree L.R Tiwari College Of Engineering
Siddhi Sankhe Dept Of Computer Engineering, Shree L.R Tiwari College Of Engineering
Narendra Prajapati Dept Of Computer Engineering, Shree L.R Tiwari College Of Engineering
Manasi Churi Dept Of Computer Engineering, Shree L.R Tiwari College Of Engineering

DOI:

https://doi.org/10.65521/ijacte.v15i1.2626

Keywords:

Automated Pronunciation Evaluation Phoneme-Level Assessment Multilingual ASR Grapheme-to-Phoneme Conversion Dynamic Time Warping

Abstract

Automated Pronunciation Evaluation plays a major role in computer-assisted learning for various languages, majorly used for learning English, and many other languages. However, effective multilingual systems for pronunciation assessment are not yet fully developed, particularly for Indic languages which have complex character and phonetic systems. Most pronunciation assessment systems utilize word-level scoring or limited acoustic models, which restrict the scope for phoneme-level assessment and accommodating linguistic diversity. Additionally, errors resulting from ASR systems affect the overall accuracy of the scoring process. This paper proposes a framework for phoneme-level pronunciation assessment system for English, Hindi, and Marathi languages. The system is developed by integrating the Whisper ASR model, word-level timestamp extraction, grapheme-to-phoneme conversion, Dynamic Time Warping for robust word alignment, and phoneme-level Levenshtein distance scoring. In addition, the schwa deletion module is included to handle Devanagari languages. The schwa deletion module is designed to eliminate the impact of schwa characters on pronunciation scores.

The framework is based on a modular three-tier structure that includes a browser-based audio capture interface, a Flask-based REST API backend, and an extensible AI processing core developed based on interface-driven model abstractions. Normalization and resampling of audio signals are performed before ASR inference to improve consistency across recording conditions, while DTW-based word alignment over a word distance matrix enhances robustness against ASR variability. The experimental results show stable word alignment against recognition noise and consistent accuracy discrimination for phoneme-level qualities across varying word pronunciation qualities. Word-level categorization and IPA visualization are provided for actionable feedback on pronunciation qualities for multilingual learning scenarios.

AI Automatic Pronunciation Mistake Detector

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

Quick Links

For Authors

For Reviewers

Contact Us

Similar Articles

International Journal on Advanced Computer Theory and Engineering

A Hybrid Hospital Framework Using AI Chatbots for Real-Time Assistance and Data Management

A Survey on IoT-Enabled Smart Ambulance Systems for Enhanced Emergency Response and Rescue Operations

Deep Learning and Optimization Approaches in IoT based soil nutrition and plant disease detection system for smart agriculture using Multi-Layer Stacked Residual Coordinate Boosted Sooty Tern Attention Network: A Review

Deep Learning and Optimization Approaches in Improving the Thermo-Electro-Mechanical Responses of MEMS Resonant Accelerometers via a Novel Bidirectional Long Short-Term Memory: A Review

A Comprehensive Review of Hydro Track Systems and Automatic Water Dispenser Technologies

Campus Recruitment System using Machine Learning

AI Based Equity Portfolio Management System

Artificial Intelligence Techniques for Joint Resource Allocation, Security, and Efficient Task Scheduling in Cloud Computing Using Hybrid Pyramidal Convolution Split-Attention Networks: Trends and Challenges

Multimodal Deep Learning Architectures for Integrated Analysis of Text, Image, and Sensor Data in Intelligent Systems