Spatio-Temporal Deepfake Detection: A CNN-RNN Hybrid Approach for Image and Video Forgery Identification
Main Article Content
Abstract
The proliferation of AI-generated synthetic media, commonly known as deepfakes, poses a grave threat to digital security, public trust, and information integrity. Despite the existence of numerous detection frameworks, many suffer from limited generalizability, lack of interpretability, and poor performance against high-quality forgeries. This paper presents a spatio-temporal deepfake detection system that integrates Convolutional Neural Networks (CNNs) for spatial artifact extraction with Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units for temporal inconsistency modeling. The system employs MTCNN-based face extraction followed by VGG16/VGG19 backbone networks for per- frame analysis, and leverages the Deepfake Detection Challenge (DFDC) dataset with careful class balancing to overcome dataset bias and fake-accuracy pitfalls. Experimental results demonstrate that the hybrid spatio- temporal approach significantly outperforms single-modality baselines, achieving robust detection across varying lighting conditions, compression levels, and face resolutions. The proposed framework lays the foundation for a real-time multimodal deepfake authentication system integrating audio-visual synchronization analysis.