Establishing the difficulty of test items is an essential part of the language assessment development process. However, traditional item calibration methods are often time-consuming and difficult to scale. To address this, recent research has explored natural language processing (NLP) approaches for automatically predicting item difficulty from text. This paper investigates the use of transformer models to predict the difficulty of second language (L2) English vocabulary test items that have multilingual prompts. We introduce an extended version of the British Council’s Knowledge-based Vocabulary Lists (KVL), containing 6,768 English words paired with difficulty scores and question prompts written in Spanish, German, and Mandarin Chinese.
Using this dataset for finetuning, we explore various transformer-based architectures. Our findings show that a multilingual model jointly trained on all L1 subsets of the KVL achieve the best results, with insights suggesting that the model is able to learn global patterns of cross-linguistic influence on target word difficulty. This study establishes a foundation for NLP-based item difficulty estimation using the KVL dataset, providing actionable insights for developing multilingual test items.
Lucy Skidmore, Mariano Felice and Karen J. Dunn. (2025). Transformer Architectures for Vocabulary Test Item Difficulty Prediction. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), Austria, Vienna. Association for Computational Linguistics.
Read the paper: Transformer architectures for vocabulary test item difficulty prediction (Adobe PDF)
View and download the dataset (ZIP file)