Semantic insurance pricing with large language models
Abstract
Classical actuarial pricing models, such as the generalized linear model, are valued for transparency and ease of governance, but they use interactions among risk factors only when these are supplied through explicit feature engineering. We study whether embeddings from a pre-trained large language model, computed from a natural-language description of each policyholder, can replace hand-crafted features as inputs to a standard actuarial pricing model, taking Poisson claim-frequency regression as the main example. The language model is used only to construct deterministic embedding covariates; pricing is performed by a standard generalized linear model. Using French motor third-party liability data, the embedding-based model outperforms the generalized linear model, especially when data are scarce, whereas at larger sample sizes the comparison is model- and dimension-dependent. Insurance-specific fine-tuning further improves the embeddings, and a prompt-sensitivity diagnostic shows that the pipeline reacts to any appended out-of-template field, making controlled prompts a governance requirement.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.