Open Problems in Constitutional Preference Reconstruction

Robert Mullins

Open Problems in Constitutional Preference Reconstruction

Abstract

Pairwise preference data is widely used for training and evaluating language models (e.g., RLHF), but each datapoint records a choice, not the rationale behind it. Methods such as Inverse Constitutional AI (ICAI) attempt to improve interpretability by compressing datasets into short ``constitutions'' of natural-language principles. We argue this framing is under-specified: a flat list of principles is not yet an executable decision rule because it leaves principle composition implicit. We use the pairwise setting as a testbed to empirically characterize three open problems in constitutional methods. First, principle quality is hard to measure: coverage and accuracy are useful but incomplete proxies for end-to-end reconstruction. Second, composition is ambiguous: holding principles fixed, different executors (LLM judge versus majority vote) agree only 73\% of the time. Third, constitutions differ between LLMs: cross-model vote agreement is 73\%, whereas intra-model agreement is 81\%. Across PRISM, AlpacaEval, and Chatbot Arena, we show that principle refinement (ICAI+) may be a first step towards ameliorating these problems: inter-executor agreement rises to 78\%, and transparent executors match LLM judge accuracy (66\% vs.\ 67\%). Our results highlight that constitutions should be evaluated as constitution--executor systems, with implications for LLMs-as-a-judge broadly.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…