Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas
Bunzeck B, Duran D, Schade L, Zarrieß S (2024)
arXiv:2410.01487.
Preprint | Englisch
Download
Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!
Einrichtung
Fakultät für Linguistik und Literaturwissenschaft > Department Linguistik
SFB 1646 „Sprachliche Kreativität in der Kommunikation“ > Bereich A: Sprachliche Kreativität und das Zeichen: Produktivität, Variabilität, Originalität > A02: Erstellen neuer phonetischer Darstellungen in unterschiedlichen Kommunikationsumgebungen
SFB 1646 „Sprachliche Kreativität in der Kommunikation“ > Bereich A: Sprachliche Kreativität und das Zeichen: Produktivität, Variabilität, Originalität > A02: Erstellen neuer phonetischer Darstellungen in unterschiedlichen Kommunikationsumgebungen
Abstract / Bemerkung
Current language models use subword-based tokenization algorithms like Byte
Pair Encoding, which put their validity as models of linguistic representations
into question. In this paper, we explore the potential of tokenization-free,
phoneme- and grapheme-based language models. We demonstrate that small models
based on the Llama architecture can achieve strong linguistic performance on
standard syntactic and novel lexical/phonetic benchmarks when trained with
character-level vocabularies. We further show that phoneme-based models without
any graphemic biases almost match grapheme-based models in standard tasks and
novel evaluations. Our findings suggest a promising direction for creating more
linguistically plausible language models that are better suited for
computational studies of language acquisition and processing.
Erscheinungsjahr
2024
Zeitschriftentitel
arXiv:2410.01487
Page URI
https://pub.uni-bielefeld.de/record/2993396
Zitieren
Bunzeck B, Duran D, Schade L, Zarrieß S. Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487. 2024.
Bunzeck, B., Duran, D., Schade, L., & Zarrieß, S. (2024). Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487
Bunzeck, Bastian, Duran, Daniel, Schade, Leonie, and Zarrieß, Sina. 2024. “Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”. arXiv:2410.01487.
Bunzeck, B., Duran, D., Schade, L., and Zarrieß, S. (2024). Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487.
Bunzeck, B., et al., 2024. Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487.
B. Bunzeck, et al., “Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”, arXiv:2410.01487, 2024.
Bunzeck, B., Duran, D., Schade, L., Zarrieß, S.: Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. arXiv:2410.01487. (2024).
Bunzeck, Bastian, Duran, Daniel, Schade, Leonie, and Zarrieß, Sina. “Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”. arXiv:2410.01487 (2024).