Overcoming Unimodal Challenges: A Survey of Multi-Modal Fusion for Mobile Interfaces
DOI:
https://doi.org/10.65521/intjournalrecadvengtech.v14i3s.1685Keywords:
Abstract
The proliferation of mobile devices has spurred the development of interaction paradigms that extend beyond traditional touch inputs, catering to users with motor impairments and situations requiring hands-free operation. This paper presents a comprehensive survey of the primary modalities for hands-free mobile interaction: head pose estimation, eye-gaze tracking, and voice command recognition. We conduct a comparative analysis of the algorithmic evolution within each modality, tracing the progression from classical computer vision techniques to modern deep learning architectures. For head pose estimation, we evaluate the trade-offs between landmark-based and landmark-free methods, with a focus on lightweight models suitable for on-device deployment. For eye-gaze tracking, we compare model-based and appearance-based approaches, highlighting the critical role of large-scale datasets in achieving robustness. For voice, we analyze the performance characteristics of on-device versus cloud-based speech recognition and the architectural necessity of low-power keyword spotting. Furthermore, we analyze the synergistic potential of multimodal fusion as a solution to inherent unimodal challenges, most notably the "Midas Touch problem." By synthesizing findings from across the field, this survey provides a structured overview of the state of the art and identifies key considerations for designing the next generation of effective and accessible hands-free systems.
Downloads
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.