Open Problems in Constitutional Preference Reconstruction
Abstract
Pairwise preference data is widely used for training and evaluating language models (e.g., RLHF), but each datapoint records a choice, not the rationale behind it. Methods such as Inverse Constitutional AI (ICAI) attempt to improve interpretability by compressing datasets into short ``constitutions'' of natural-language principles. We argue this framing is under-specified: a flat list of principles is not yet an executable decision rule because it leaves principle composition implicit. We use the pairwise setting as a testbed to empirically characterize three open problems in constitutional methods. First, principle quality is hard to measure: coverage and accuracy are useful but incomplete proxies for end-to-end reconstruction. Second, composition is ambiguous: holding principles fixed, different executors (LLM judge versus majority vote) agree only 73\% of the time. Third, constitutions differ between LLMs: cross-model vote agreement is 73\%, whereas intra-model agreement is 81\%. Across PRISM, AlpacaEval, and Chatbot Arena, we show that principle refinement (ICAI+) may be a first step towards ameliorating these problems: inter-executor agreement rises to 78\%, and transparent executors match LLM judge accuracy (66\% vs.\ 67\%). Our results highlight that constitutions should be evaluated as constitution--executor systems, with implications for LLMs-as-a-judge broadly.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.