Sequential Document Representations and Simplicial Curves

Abstract

The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present a continuous and differentiable sequential document representation that goes beyond the bag of words assumption, and yet is efficient and effective. This representation employs smooth curves in the multinomial simplex to account for sequential information. We discuss the representation and its geometric properties and demonstrate its applicability for the task of text classification.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…