Overcoming Unimodal Challenges: A Survey of Multi-Modal Fusion for Mobile Interfaces

Harsh Wanjari; Danish Ayub Gaus; Uday Bhoyar; Komal K. Gehani; Sumit Prasad

doi:10.65521/intjournalrecadvengtech.v14i3s.1685

Authors

Harsh Wanjari Department of Computer Engineering St. Vincent Pallotti College of Engineering and Technology Nagpur, India
Danish Ayub Gaus Department of Computer Engineering St. Vincent Pallotti College of Engineering and Technology Nagpur, India
Uday Bhoyar Department of Computer Engineering St. Vincent Pallotti College of Engineering and Technology Nagpur, India
Komal K. Gehani Department of Computer Engineering St. Vincent Pallotti College of Engineering and Technology Nagpur, India
Sumit Prasad Department of Computer Engineering St. Vincent Pallotti College of Engineering and Technology Nagpur, India

DOI:

https://doi.org/10.65521/intjournalrecadvengtech.v14i3s.1685

Keywords:

Human-Computer Interaction (HCI) Multimodal Interaction Head Pose Estimation Gaze Tracking Voice Commands Deep Learning Mobile Accessibility Midas Touch

Abstract

The proliferation of mobile devices has spurred the development of interaction paradigms that extend beyond traditional touch inputs, catering to users with motor impairments and situations requiring hands-free operation. This paper presents a comprehensive survey of the primary modalities for hands-free mobile interaction: head pose estimation, eye-gaze tracking, and voice command recognition. We conduct a comparative analysis of the algorithmic evolution within each modality, tracing the progression from classical computer vision techniques to modern deep learning architectures. For head pose estimation, we evaluate the trade-offs between landmark-based and landmark-free methods, with a focus on lightweight models suitable for on-device deployment. For eye-gaze tracking, we compare model-based and appearance-based approaches, highlighting the critical role of large-scale datasets in achieving robustness. For voice, we analyze the performance characteristics of on-device versus cloud-based speech recognition and the architectural necessity of low-power keyword spotting. Furthermore, we analyze the synergistic potential of multimodal fusion as a solution to inherent unimodal challenges, most notably the "Midas Touch problem." By synthesizing findings from across the field, this survey provides a structured overview of the state of the art and identifies key considerations for designing the next generation of effective and accessible hands-free systems.

Downloads

Download data is not yet available.

Overcoming Unimodal Challenges: A Survey of Multi-Modal Fusion for Mobile Interfaces

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Quick Links

For Authors

For Reviewers

Contact Us

Similar Articles

AI TRAINER (Human Pose Estimation and Correction Using Machine Learning )

Machine Learning–Based Learning Analytics for Student Performance Prediction in Online Education: A Review

Visiobot: Conversational Image Recognition Chatbot

AI-Enabled Personalization in Digital Learning Platforms: A Review of Adaptive E-Learning Technologies

MentorDeskk: Intelligent E-Learning with Advanced Faculty Tools

Voice and Text-Based Healthcare Chatbot with Real-Time Multilingual Translation

Game-Based Learning and Gamification in Digital Education: A Systematic Review of Engagement Strategies

Deep Learning and Optimization Approaches in Analysing Employee Management Using Enhanced Elman Spike Neural Network Techniques and Solutions in Human Resource Management: A Review

Deep Learning and Optimization Approaches in Optimized Graph Transformer with Alpine Skiing Optimization: Improving Initiative IoT in Human Resource Management by Predicting Workers’ Stress: A Review

Federated Multimodal Language Recognition: A Deep Learning Approach for Real-Time Applications