BEA 2026 Shared Task
Vocabulary Difficulty Prediction for English Learners
Motivation
Vocabulary is a crucial aspect of language knowledge, shaping what learners can understand and produce. Establishing the difficulty of vocabulary is therefore essential for creating level-appropriate content and developing valid, reliable assessment instruments. However, determining word difficulty still relies on labour-intensive processes involving expert judgment and costly pretesting, which limits scalability and slows innovation. As learning and assessment increasingly rely on digital platforms, the need for more efficient and scalable solutions is more pressing than ever.
While previous shared tasks have explored related problems such as Complex Word Identification (Paetzold and Specia, 2016; Yimam et al., 2018), Lexical Complexity Prediction (Shardlow et al., 2021) and Lexical Simplification (Shardlow et al., 2024), they were not designed with English language learners in mind and did not explore the influence of the learner’s L1 on L2 vocabulary difficulty. What is more, BEA has not hosted a language learning challenge since the Grammatical Error Correction shared task in 2019, leaving a significant gap at a time when advances in AI have transformed what is possible in educational NLP.
As a result, we believe the time is right for an L1-Aware Vocabulary Difficulty Prediction shared task, and BEA 2026 is the ideal venue to host it. This task would not only establish a common benchmark for researchers but also serve as a critical testbed to evaluate to what extent state-of-the-art NLP models perform on a problem that has traditionally required psychometric calibration methods. The findings from this shared task will play a crucial role in the development of AI-powered solutions for item writing, content generation, adaptive testing, and personalised vocabulary learning, laying the foundation for the next generation of language learning and assessment systems.
Task description
The BEA 2026 shared task aims to advance research into vocabulary difficulty prediction for learners of English with diverse L1 backgrounds, an essential step towards custom content creation, computer-adaptive testing and personalised learning. In a context where traditional item calibration methods have become a bottleneck for the implementation of digital learning and assessment systems, we believe predictive NLP models can provide a more scalable, cost-effective solution.
The goal of this shared task is to build regression models to predict the difficulty of English words given a learner's L1. We believe this new shared task provides a novel approach to vocabulary modelling, offering a multidimensional perspective that has not been explored in previous work. To this aim, we will use the British Council's Knowledge-based Vocabulary Lists (KVL), a multilingual dataset with psychometrically calibrated difficulty scores. We believe this unique dataset is not only an invaluable contribution to the NLP community but also a powerful resource that will enable in-depth investigations into how linguistic features, L1 background and contextual cues influence vocabulary difficulty.
Tracks
The shared task includes the following two tracks:
- Closed track: Systems may only use the provided training data for each corresponding L1.
- Open track: Systems may combine data from different L1s in any way they choose, in addition to using any other publicly available training data.
Within each track, participants can submit predictions for as many L1s as they wish, from German (DE), Spanish (ES) and Mandarin (CN). Teams who submit predictions for all three L1s will also be evaluated on general cross-L1 performance.
Baseline systems will be made available for comparison during development.
Data
The data for the shared task will be taken from the recently released 'Extended KVL Dataset for NLP' which was presented at BEA 2025 (Skidmore et al., 2025). This dataset is an adaptation of the British Council’s Knowledge-based Vocabulary Lists (KVL) (Schmitt et al., 2021, 2024), which were initially developed to collate difficulty rankings of English vocabulary for learners with L1 backgrounds of Spanish, German and Mandarin.
To create the lists, the productive English language word knowledge of over 100,000 learners was assessed using items designed to test form-based recall of individual lemmas in a translation format (cf. Laufer and Goldstein, 2004).
Below is an example test item in Spanish, where learners were required to input the remainder of the target English word 'house' (the German and Mandarin versions had similar, yet distinct prompts):
|
Vivo en una casa grande que tiene tres dormitorios. casa h _ _ _ _ |
From approximately 3.3 million test responses, difficulty estimates were derived separately for each L1 background, applying random-item-random-person (RPRI) Rasch models (De Boeck, 2008) built within a generalised linear mixed model (GLMM) framework (Dunn, 2024). Further detail on the estimation of difficulty values for the KVL can be found in Schmitt et al. (2024).
Links to the datasets used in the shared task will be shared soon:
- Training data: 6091 items per L1 (6091 x 3 = 18273 instances).
- Validation data: 677 items per L1 (677 x 3 = 2031 instances).
- Test data
All the data used in the shared task is available for public use according to the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Each dataset contains unique English vocabulary test items with prompts in three L1s: Spanish, German and Mandarin. Each dataset is provided as a set of three CSV files, one for each L1, which include the following data columns:
- item_id: An ID number from 1 to 6,768. Items with the same item_id across different L1 files are parallel (i.e., refer to the same English target word).
- subset: Indicates whether the data is part of the ‘train’, ‘dev’ or ‘test’ splits used by Skidmore et al. (2025), for comparison.
- L1: The L1 of the prompt (‘es’ for Spanish, ‘de’ for German, or ‘cn’ for Mandarin).
- en_target_word: The English target word.
- en_target_pos: The part of speech of the English target word.
- en_target_clue: A partial-spelling clue of the English target word.
- L1_source_word: The corresponding L1 source word(s).
- L1_context: The L1 contextualising prompt.
- GLMM_score: The GLMM difficulty estimate for the vocabulary test item, as reported by Schmitt et al. (2024). This is the target value that will be predicted. · difficulty_value: The linearly scaled transformation of the GLMM score, as described by Skidmore et al. (2025).
The target variable for prediction is the GLMM_score.