A Thorough Literature Review on Automatic Speaker Diarization Employing Machine Learning and Deep Learning Methodologies

Sayyada Sara  Banu; Ratnadeep R. Deshmukh; Jaypalsing N. Kayte

doi:10.65521/ijaece.v15i1S.1366

Authors

Sayyada Sara Banu Dept. of CS and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MH), INDIA.
Ratnadeep R. Deshmukh Dept. of CS and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MH), INDIA.
Jaypalsing N. Kayte AI Lead, Tech Mahindra Ltd., Hi-Tech City, Hyderabad, Telangana, India

DOI:

https://doi.org/10.65521/ijaece.v15i1S.1366

Keywords:

Speaker Diarization Neural Speaker Embeddings Self-Supervised Speech Models End-to-End Diarization Speech Representation Learning

Abstract

Automatic Speaker Diarization (ASD) is the process of dividing an audio recording into regions where each speaker is the same and figuring out "who spoke when" with-out knowing who the speakers are ahead of time. It is a necessary part of meeting transcription, conversational analytics, indexing for broadcast media, forensic audio processing, call-center monitoring, and modern systems for human-computer interac-tion. In the past twenty years, diarization research has moved from traditional statis-tical models like Gaussian Mixture Models (GMMs) based on MFCCs and Bayesian Information Criterion (BIC) segmentation to more advanced representation learning methods like i-vectors and Probabilistic Linear Discriminant Analysis (PLDA). Later advances in deep learning led to strong neural embeddings like x-vectors and ECAPA-TDNN, which made it much easier to identify speakers in difficult sound situations. The most current Self-Supervised Learning (SSL) models, such as Wav2Vec 2.0, HuBERT, and WavLM, have set new standards by learning strong speech representations without any labeled input. End-to-End Neural Diarization (EEND), UIS-RNN, and VB-HMM re-segmentation are some of the complementary methods that have improved how well we can handle overlaps and refine time.

This evaluation offers a thorough examination of recent advancements, evaluat-ing the advantages and disadvantages of prominent diarization methodologies, pin-pointing enduring research deficiencies, and delineating prospective avenues for the enhancement of precise, multilingual, and real-time speaker diarization systems.

A Thorough Literature Review on Automatic Speaker Diarization Employing Machine Learning and Deep Learning Methodologies

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Quick Links

For Authors

For Reviewers

Contact Us

Similar Articles

AI-Driven Cyber-Physical System Security: Intrusion Detection and Predictive Threat Intelligence Models

Hybrid Deep Learning Approaches for Single-Lead ECG-Based Atrial Fibrillation Detection

Deep Learning Approaches for Electric Vehicle Charging and Smart Grid Coordination: A Review

Hybrid Deep Learning Optimization for Dual-Stage Interleaved EV Onboard Charger Systems

A Survey of Methods and Architectures for Enhancing Air Pollution Detection Accuracy and Quality Monitoring Using Pyramidal Convolution Split-Attention Networks and IoT

A Comprehensive Review of EEG-Based Classification of Neuropsychiatric Disorders Using Deep Sparse Neural Networks with Gooseneck Barnacle Optimization Algorithm

Deep Learning Approaches for EEG-Based Automatic Schizophrenia Identification: A Review

A Light Weight Neural Network Model for Classification of Dementia

Recent Advances in an Effective Progressive Dense Self-Attention based Human Resource Recommendation for Predicting Employee Turnover: A Systematic Review