Self-Adaptive AI for Task Scheduling in Heterogeneous Computing
Main Article Content
Abstract
This trend towards heterogeneous computing, whereby CPUs, GPUs and FPGAs are working in concert, is not just about power but about ensuring that valid systems can be built for these applications like AI training or data-intensive research. Yet, scheduling tasks to combine these heterogeneous resources is still a challenging problem. Static techniques such as HEFT are susceptible to unexpected and varying conditions; dynamic heuristics like Min-Min generally focus to near optimal solution, but lack true adaptability. Studies (e.g., [47, 56]) have demonstrated that these kinds of applications may incur performance inefficiency as high as over 20% when the workloads modulate itself, which brings into question the robustness of existing approaches. Reinforcement Learning (RL), despite its flaws, also provides some hope in learning on-the-fly rather than rote-acceptance of stability. The average completion times for largescale GPU clusters have reportedly dropped below 320 ms through DRL-based schedulers, as opposed to over 400 ms by the HEFT. This difference means less wasted cycles and lower cost per joule. However, RL models propose challenges: they are expensive to train and tend not to be transparent. This paper contends that self-adaptive AI scheduling is a direction full of promises but that needs to be treaded carefully and with the right degree of expectations.
Article Details

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.