Accurate Chemistry Collection: Coupled cluster atomization energies for broad chemical space
Abstract
Accurate thermochemical data with sub-chemical accuracy (within 1 kcal mol-1 of the empirical ground truth) are essential for advancing computational chemistry methods. However, existing datasets that reach this level of accuracy remain limited in size or scope. This hinders the development of data-driven methods with predictive accuracy across the broad chemical space of closed-shell, neutral molecules. Here we present Microsoft Research Accurate Chemistry Collection (MSR-ACC) and its first release, MSR-ACC/TAE25, comprising 73,040 total atomization energies at the CCSD(T)/CBS level obtained with the W1-F12 thermochemical protocol. The dataset is constructed to exhaustively cover the chemical space of closed-shell, charge-neutral, covalently bound equilibrium molecular structures containing up to 5 non-hydrogen atoms drawn from elements up to argon and lacking significant multireference character. The dataset and its canonical train and validation splits are openly available on Zenodo in the QCSchema format under the CDLA Permissive 2.0 license. This first release of MSR-ACC enables data-driven approaches for developing predictive computational chemistry methods with unprecedented accuracy and scope.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.