Language corpora for the Dutch medical domain

Abstract

Background: Dutch medical corpora are scarce, limiting NLP development. \\ Methods: We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ Results: The resulting corpus comprises 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ Conclusion: This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…