Learning Service Slowdown using Observational Data
Abstract
Being able to identify service slowdowns is crucial to many operational problems. We study how to use observational congestion data to learn service slowdown in a multi-server system that uses adaptive congestion control mechanisms. We show that a commonly used summary statistic that relies on the marginal congestion measured at individual servers can be highly inaccurate in the presence of adaptive congestion control. We propose a new statistic based on potential routing actions, and show it provides a much more robust signal for server slowdown in these settings. Unlike the marginal statistic, potential action aims to detect changes in the routing actions, and is able to uncover slowdowns even when they do not reflect in marginal congestion. Our results highlight the complexity in performing observational statistical analysis for service systems in the presence of adaptive congestion control. They also suggest that practitioners may want to combine multiple, orthogonal statistics to achieve reliable slowdown detection.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.