Bandit Online Optimization Over the Permutahedron
Abstract
The permutahedron is the convex polytope with vertex set consisting of the vectors (π(1),…, π(n)) for all permutations (bijections) π over \1,…, n\. We study a bandit game in which, at each step t, an adversary chooses a hidden weight weight vector st, a player chooses a vertex πt of the permutahedron and suffers an observed loss of Σi=1n π(i) st(i). A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of O(nT n) for a time horizon of T. Unfortunately, CombBand requires at each step an n-by-n matrix permanent approximation to within improved accuracy as T grows, resulting in a total running time that is super linear in T, making it impractical for large time horizons. We provide an algorithm of regret O(n3/2T) with total time complexity O(n3T). The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.