The Isabelle Community Benchmark

Abstract

Choosing hardware for theorem proving is no simple task: automated provers are highly complex and optimized programs, often utilizing a parallel computation model, and there is little prior research on the hardware impact on prover performance. To alleviate the problem for Isabelle, we initiated a community benchmark where the build time of HOL-Analysis is measured. On 54 distinct CPUs, a total of 669 runs with different Isabelle configurations were reported by Isabelle users. Results range from 107s to over 11h. We found that current consumer CPUs performed best, with an optimal number of 8 to 16 threads, largely independent of heap memory. As for hardware parameters, CPU base clock affected multi-threaded execution most with a linear correlation of 0.37, whereas boost frequency was the most influential parameter for single-threaded runs (correlation coefficient 0.55); cache size played no significant role. When comparing our benchmark scores with popular high-performance computing benchmarks, we found a strong linear relationship with Dolfyn (R2 = 0.79) in the single-threaded scenario. Using data from the 3DMark CPU Profile consumer benchmark, we created a linear model for optimal (multi-threaded) Isabelle performance. When validating, the model has an average R2-score of 0.87; the mean absolute error in the final model corresponds to a wall-clock time of 46.6s. With a dataset of true median values for the 3DMark, the error improves to 37.1s.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…