A Multi-Survey Machine-Readable Corpus of Milky Way Globular Cluster Parameters for Retrieval-Augmented Generation Applications
Abstract
We present the Milky Way Globular Cluster Corpus v1.3.1, a unified machine-readable database of fundamental parameters for 174 Milky Way globular clusters assembled from four independent published surveys. Each cluster record integrates photometric, structural, and spectroscopically-calibrated metallicity parameters from Harris (1996) (2010 revision), Gaia EDR3 proper motions from Vasiliev & Baumgardt (2021), N-body dynamical masses and orbital parameters from Baumgardt et al. (2023), and mean chemical abundances from the APOGEE DR17 globular cluster Value Added Catalog of Schiavon et al. (2024). The corpus contains 17,438 non-null data points across 174 clusters stored in JSONL, JSON, and flat CSV formats with consistent native-typed fields (float, int, bool, null), embedded provenance blocks, and fully documented schema. Survey coverage is 157/174 clusters for Harris photometry, 170/174 for Gaia EDR3 proper motions, 154/174 for Baumgardt N-body dynamics, and 72/174 for APOGEE DR17 chemistry. The corpus was designed as a Retrieval-Augmented Generation (RAG) knowledge base for large language model applications in astrophysics research, following the same multi-survey integration methodology as the Unified Galaxy HI Rotation Curve Corpus (Flynn 2026), and has been validated for structured context injection with instruction-following language models. It is equally suitable for traditional quantitative analyses including orbit modeling, cluster classification, chemical tagging, and multi-survey cross-validation. The dataset is available at Zenodo DOI: 10.5281/zenodo.19907766.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.