VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

Abstract

Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs. To improve visibility into model reasoning on formal correctness, we introduce VerifyThisBench, a new benchmark that evaluates end-to-end program verification from natural language descriptions: models must (i) extract formal specifications, (ii) implement in a verification-aware language, and (iii) construct machine checkable proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4%, with many outputs failing to compile. To isolate sources of difficulty, we further propose VerifyThisBenchXS, a relaxed variant in which partial implementations or proofs are provided. Across nine models and seven verification tools on both benchmarks, we observe consistent gains with feedback-driven refinement, but overall pass rates remain low, underscoring substantial gaps in formal reasoning. We release the benchmark and the unified evaluation environment to catalyze the verification capabilities for future models.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…