Predictive Maintenance Using Deep Reinforcement Learning in Cloud Infrastructure Management
Main Article Content
Abstract
The exponential growth of cloud computing infrastructures has intensified the need for intelligent predictive maintenance to ensure reliability and minimize downtime. This paper presents a novel Deep Reinforcement Learning (DRL)-based framework for predictive maintenance in cloud infrastructure management. The proposed system integrates telemetry-based data acquisition, deep policy learning, and orchestration-driven action execution to create an adaptive, self-healing maintenance ecosystem. Using telemetry data from simulated multi-node cloud environments, the DRL agent learns optimal maintenance policies that minimize failure risk while reducing operational costs. Comparative analysis against traditional models—Random Forest, LSTM, and Q-Learning—demonstrates the superior performance of the DRL approach, achieving 96.3% fault prediction accuracy, 42.1% downtime reduction, and 39.5% maintenance cost savings. The framework’s closed-loop architecture enables continuous learning and dynamic optimization, ensuring proactive fault mitigation and resource efficiency. Results highlight the framework’s scalability, adaptability, and real-time decision-making capability, confirming its potential to revolutionize predictive maintenance in cloud systems. Future work will extend the model to multi-agent and federated settings for distributed predictive intelligence in hybrid cloud environments.