Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code

Abstract

Motivation: The rapid growth of biological data has intensified the need for transparent, reproducible, and well-documented computational workflows. The ability to clearly connect the steps of a workflow in the code with their description in a paper would improve workflow comprehension, support reproducibility, and facilitate reuse. This task requires the linking of bioinformatics tools in workflow code with their mentions in a published workflow description. Results: We present CoPaLink, an automated approach that integrates three components: named entity recognition (NER) for identifying tool mentions in scientific text, NER for tool mentions in workflow code, and entity resolution based on word embedding similarity. We propose approaches for all three steps, achieving a high individual F1-measure (77 - 90) and a joint accuracy of 66 when evaluated on Nextflow workflows using Sentence-BERT. CoPaLink leverages corpora of scientific articles and workflow executable code with curated tool annotations to bridge the gap between narrative descriptions and workflow implementations. Availability: The code is available at https://gitlab.liris.cnrs.fr/sharefair/copalink-experiments and https://gitlab.liris.cnrs.fr/sharefair/copalink. The corpora are also available: CPL-Article (https://doi.org/10.5281/zenodo.20746904), CPL-Code (https://doi.org/10.5281/zenodo.20746970) and CPL-Gold-Entity-Resolution (https://doi.org/10.5281/zenodo.20746994).

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…