Improving reproducibility of cheminformatics workflows with chembl-downloader

Abstract

Many modern cheminformatics workflows derive datasets from ChEMBL, but few of these datasets are published with accompanying code for their generation. Consequently, their methodologies (e.g., selection, filtering, aggregation) are opaque, reproduction is difficult, and interpretation of results therefore lacks important context. Further, such static datasets quickly become out-of-date. For example, the current version of ChEMBL is v35 (as of December 2024), but ExCAPE-DB uses v20, Deep Confidence uses v23, the consensus dataset from Isigkeit et al. (2022) uses v28, and Papyrus uses v30. Therefore, there is a need for tools that provide reproducible bulk access to the latest (or a given) version of ChEMBL in order to enable researchers to make their derived datasets more transparent, updatable, and trustworthy. This article introduces `chembl-downloader`, a Python package for the reproducible acquisition, access, and manipulation of ChEMBL data through its FTP server. It can be downloaded under the MIT license from https://github.com/cthoyt/chembl-downloader and installed from PyPI with `pip install chembl-downloader.`

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…