Laminar: A Probe-First Scheduling Paradigm with Deterministic Runtime Survival
Abstract
In exascale-oriented GPU clusters, rigid-topology jobs leave behind a fragmented post-landing ecology in which long-resident workloads and highly transient tasks compete for unstable residual capacity. Existing centralized, hierarchical, and local-first decentralized schedulers incur growing coordination and retry-amplification costs in this regime and typically stop their explicit responsibility at execution start, leaving runtime survival to indiscriminate host-level OOM heuristics. We present Laminar, a decentralized probe-first, execute-later scheduling paradigm that keeps hot-path control-plane work near O(1) through Zone-level probabilistic flow splitting, bounded in-Zone probing by persistent lightweight agents, and node-local arbitration. Laminar further introduces Airlock, a bounded node-local runtime-survival layer that converts severe memory pressure into an ordered policy of suspension, in-situ recovery, bounded secondary re-addressing, or reclamation. By enforcing priority-ordered survival under pressure, Laminar enables lifecycle-aware scheduling that preserves high-value long-resident work and operates closer to physical saturation without relying on protocol-level overcommitment.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.