Do\u{g}al Dil \.I\c{s}lemede Tokenizasyon Standartlar{\i} ve \"Ol\c{c}\"um\"u: T\"urk\c{c}e \"Uzerinden B\"uy\"uk Dil Modellerinin Kar\c{s}{\i}la\c{s}t{\i}rmal{\i} Analizi

Savaş Yıldırım

doi:10.1109/SIU66497.2025.11112220

Dogal Dil \.Islemede Tokenizasyon Standartlar ve \"Olc\"um\"u: T\"urkce \"Uzerinden B\"uy\"uk Dil Modellerinin Karslastrmal Analizi

Abstract

Tokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (\%TR), and token purity (\%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…