Human vs Machine Mathematical Difficulty on Project Euler: An Experimental Analysis

Abstract

We study how the effort and success probability of frontier AI systems scale with human difficulty on problems from Project Euler, an online platform of computational mathematics problems. Our dataset, from the MathArena benchmark, consists of 3840 attempts across 50 problems and 26 model configurations, with problem difficulty measured by the site's public human solve times. Motivated by a proposal of Timothy Gowers, we test a power-law relation tmachine = a · thumanb between generated-token cost per successful answer and human time, and find b < 1 for 20 of the 25 models with usable fits, including the strongest base models; this operationalization therefore does not support an earlier prediction that machines scale worse than humans with difficulty. We also investigate whether success probability on the tested problems can be modeled by a simple exponential decay psuccess = ec thuman, predicting a linear relation between psuccess and thuman. Using a binning approach for data aggregation we find moderate empirical support (median bin-level R2 = 0.92 across the 22 best-covered configurations) for this model. Following METR, we also fit logistic success curves and extract 50\% task-length horizons h50; the strongest configurations in our 20 April 2026 snapshot reach roughly 2.5--4.3 hours on our fastest-five human baseline, with a log-linear fit through the state-of-the-art frontier giving a descriptive doubling time of about 75~days for the SOTA h50.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…