Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

Bunzeck B, Duran D, Schade L, Zarrieß S (2024)
arXiv:2410.01487.

Preprint | Englisch
 
Download
Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!
Abstract / Bemerkung
Current language models use subword-based tokenization algorithms like Byte Pair Encoding, which put their validity as models of linguistic representations into question. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models without any graphemic biases almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
Erscheinungsjahr
2024
Zeitschriftentitel
arXiv:2410.01487
Page URI
https://pub.uni-bielefeld.de/record/2993396

Zitieren

Bunzeck B, Duran D, Schade L, Zarrieß S. Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487. 2024.
Bunzeck, B., Duran, D., Schade, L., & Zarrieß, S. (2024). Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487
Bunzeck, Bastian, Duran, Daniel, Schade, Leonie, and Zarrieß, Sina. 2024. “Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”. arXiv:2410.01487.
Bunzeck, B., Duran, D., Schade, L., and Zarrieß, S. (2024). Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487.
Bunzeck, B., et al., 2024. Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487.
B. Bunzeck, et al., “Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”, arXiv:2410.01487, 2024.
Bunzeck, B., Duran, D., Schade, L., Zarrieß, S.: Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487. (2024).
Bunzeck, Bastian, Duran, Daniel, Schade, Leonie, and Zarrieß, Sina. “Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”. arXiv:2410.01487 (2024).
Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Quellen

arXiv: 2410.01487

Suchen in

Google Scholar