Sequence graphs realizations and ambiguity in language models
Abstract
Several popular language models represent local contexts in an input text x as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in x, with edges representing the (ordered) co-occurrence of two words within a sliding window of size w. However, this compressed representation is not generally bijective: some may be ambiguous, admitting several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and algorithmic point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size w, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When w=2, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the \#P-hardness of enumeration. For w 3, we prove the hardness of all variants, even when w is considered as a constant, with the notable exception of the undirected unweighted case for which we propose XP algorithms for both problems, tight due to a corresponding W[1]-hardness result. We conclude with an integer program formulation to solve the realizability problem, and a dynamic programming algorithm to solve the enumeration problem in instances of moderate sizes. This work leaves open the membership to NP of both problems, a non-trivial question due to the existence of minimum realizations having size exponential on the instance encoding.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.