Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

Bunzeck B, Duran D, Schade L, Zarrieß S (2025)
In: Proceedings of the 31st International Conference on Computational Linguistics. Rambow O, Wanner L, Apidianaki M, Al-Khalifa H, Eugenio BD, Schockaert S (Eds); Abu Dhabi, UAE: Association for Computational Linguistics: 6039-6048.

Konferenzbeitrag | Englisch
 
Download
OA 280.44 KB
Herausgeber*in
Rambow, Owen; Wanner, Leo; Apidianaki, Marianna; Al-Khalifa, Hend; Eugenio, Barbara Di; Schockaert, Steven
Abstract / Bemerkung
Recent work investigates whether LMs learn human-like linguistic generalizations and representations from developmentally plausible amounts of data. Yet, the basic linguistic units processed in these LMs are determined by subword-based tokenization, which limits their validity as models of learning at and below the word level. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
Erscheinungsjahr
2025
Titel des Konferenzbandes
Proceedings of the 31st International Conference on Computational Linguistics
Seite(n)
6039-6048
Konferenz
COLING 2025
Konferenzort
Abu Dhabi, UAE
Page URI
https://pub.uni-bielefeld.de/record/3000275

Zitieren

Bunzeck B, Duran D, Schade L, Zarrieß S. Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. In: Rambow O, Wanner L, Apidianaki M, Al-Khalifa H, Eugenio BD, Schockaert S, eds. Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: Association for Computational Linguistics; 2025: 6039-6048.
Bunzeck, B., Duran, D., Schade, L., & Zarrieß, S. (2025). Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, & S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics (pp. 6039-6048). Abu Dhabi, UAE: Association for Computational Linguistics.
Bunzeck, Bastian, Duran, Daniel, Schade, Leonie, and Zarrieß, Sina. 2025. “Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”. In Proceedings of the 31st International Conference on Computational Linguistics, ed. Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, 6039-6048. Abu Dhabi, UAE: Association for Computational Linguistics.
Bunzeck, B., Duran, D., Schade, L., and Zarrieß, S. (2025). “Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas” in Proceedings of the 31st International Conference on Computational Linguistics, Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B. D., and Schockaert, S. eds. (Abu Dhabi, UAE: Association for Computational Linguistics), 6039-6048.
Bunzeck, B., et al., 2025. Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. In O. Rambow, et al., eds. Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: Association for Computational Linguistics, pp. 6039-6048.
B. Bunzeck, et al., “Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”, Proceedings of the 31st International Conference on Computational Linguistics, O. Rambow, et al., eds., Abu Dhabi, UAE: Association for Computational Linguistics, 2025, pp.6039-6048.
Bunzeck, B., Duran, D., Schade, L., Zarrieß, S.: Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas. In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., and Schockaert, S. (eds.) Proceedings of the 31st International Conference on Computational Linguistics. p. 6039-6048. Association for Computational Linguistics, Abu Dhabi, UAE (2025).
Bunzeck, Bastian, Duran, Daniel, Schade, Leonie, and Zarrieß, Sina. “Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas”. Proceedings of the 31st International Conference on Computational Linguistics. Ed. Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert. Abu Dhabi, UAE: Association for Computational Linguistics, 2025. 6039-6048.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung - Nicht-kommerziell - Weitergabe unter gleichen Bedingungen 3.0 Unported (CC BY-NC-SA 3.0):
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2025-01-21T05:56:39Z
MD5 Prüfsumme
38d44fcef5e404b07a1389d7e735b245


Link(s) zu Volltext(en)
Access Level
OA Open Access

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar