Presenter Information

Rabia RabiaFollow

Location

https://www.kennesaw.edu/ccse/events/computing-showcase/fa25-cday-program.php

Document Type

Event

Start Date

24-11-2025 4:00 PM

Description

Learning-based schedulers such as Decima can optimize directed acyclic graph (DAG) workloads, yet their robustness under changing workload conditions is not well understood. This project evaluates how a Decima-trained policy transfers across different workload scenarios using an automated training and testing pipeline. Results show that the scheduler generalizes well to a workload with the same job scale, achieving a 1.9% improvement in average job completion time. Performance remains stable under a larger workload, but a shift in arrival pattern leads to an 83.7% increase in completion time and reduced fairness. These findings highlight both the potential and the limitations of learned scheduling policies, emphasizing the need for adaptive methods such as fine-tuning for reliable use in dynamic cluster environments.

Share

COinS
 
Nov 24th, 4:00 PM

GRP-1231 Evaluating Generalization and Adaptation of Learning-Based Schedulers for Directed Acyclic Graph Workloads

https://www.kennesaw.edu/ccse/events/computing-showcase/fa25-cday-program.php

Learning-based schedulers such as Decima can optimize directed acyclic graph (DAG) workloads, yet their robustness under changing workload conditions is not well understood. This project evaluates how a Decima-trained policy transfers across different workload scenarios using an automated training and testing pipeline. Results show that the scheduler generalizes well to a workload with the same job scale, achieving a 1.9% improvement in average job completion time. Performance remains stable under a larger workload, but a shift in arrival pattern leads to an 83.7% increase in completion time and reduced fairness. These findings highlight both the potential and the limitations of learned scheduling policies, emphasizing the need for adaptive methods such as fine-tuning for reliable use in dynamic cluster environments.