A Thorough Literature Review on Automatic Speaker Diarization Employing Machine Learning and Deep Learning Methodologies

Sayyada Sara  Banu; Ratnadeep R. Deshmukh; Jaypalsing N. Kayte

doi:10.65521/ijaece.v15i1S.1366

Authors

Sayyada Sara Banu Dept. of CS and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MH), INDIA.
Ratnadeep R. Deshmukh Dept. of CS and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MH), INDIA.
Jaypalsing N. Kayte AI Lead, Tech Mahindra Ltd., Hi-Tech City, Hyderabad, Telangana, India

DOI:

https://doi.org/10.65521/ijaece.v15i1S.1366

Keywords:

Speaker Diarization Neural Speaker Embeddings Self-Supervised Speech Models End-to-End Diarization Speech Representation Learning

Abstract

Automatic Speaker Diarization (ASD) is the process of dividing an audio recording into regions where each speaker is the same and figuring out "who spoke when" with-out knowing who the speakers are ahead of time. It is a necessary part of meeting transcription, conversational analytics, indexing for broadcast media, forensic audio processing, call-center monitoring, and modern systems for human-computer interac-tion. In the past twenty years, diarization research has moved from traditional statis-tical models like Gaussian Mixture Models (GMMs) based on MFCCs and Bayesian Information Criterion (BIC) segmentation to more advanced representation learning methods like i-vectors and Probabilistic Linear Discriminant Analysis (PLDA). Later advances in deep learning led to strong neural embeddings like x-vectors and ECAPA-TDNN, which made it much easier to identify speakers in difficult sound situations. The most current Self-Supervised Learning (SSL) models, such as Wav2Vec 2.0, HuBERT, and WavLM, have set new standards by learning strong speech representations without any labeled input. End-to-End Neural Diarization (EEND), UIS-RNN, and VB-HMM re-segmentation are some of the complementary methods that have improved how well we can handle overlaps and refine time.

This evaluation offers a thorough examination of recent advancements, evaluat-ing the advantages and disadvantages of prominent diarization methodologies, pin-pointing enduring research deficiencies, and delineating prospective avenues for the enhancement of precise, multilingual, and real-time speaker diarization systems.

A Thorough Literature Review on Automatic Speaker Diarization Employing Machine Learning and Deep Learning Methodologies

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Quick Links

For Authors

For Reviewers

Contact Us

Similar Articles

Recent Advances in an Optimized Causal Dilated Convolutional Neural Networks-Based Energy-Efficient and Delay-Sensitive Routing Paths Using Mobility Prediction in Mobile WSN: A Systematic Review

Intelligent Intrusion Detection Systems Using Machine Learning

Recent Advances in Hardware Efficiency of CNN Architecture Design Using Decoder-Based Low Power Approximate Multiplier and Error Reduced Carry Prediction Approximate Adder for MNIST Dataset Classification: A Systematic Review

Cognitive Computing-Based Personalized Recommendation Systems Using Behavioral and Contextual Intelligence

AI-Based Android Keyboard: Auto-Suggestion and Grammar Correction

A Comprehensive Review of Deep ConVGNet: Efficient Framework for Brain Tumour Classification with Masked-attention Mask Transformer based Segmentation

AI-Driven Cyber Defense: Enhancing Data Security and Securing Human and Non-Human Identities Against Modern Cyber Attacks

Recent Advances in Convolutional Autoencoder with Dual-Key Transformer Network-Based Causality Analysis of Human Resource Practices on Firm Performance: A Systematic Review

Skin Disease Detection Using Machine Learning Algorithm

AI Driven Context-Aware DDoS Detection and Mitigation Framework Using Optimized CNN–BiLSTM and Reinforcement Learning