BEA 2026 Shared Task

Vocabulary Difficulty Prediction for English Learners

Motivation

Vocabulary is a crucial aspect of language knowledge, shaping what learners can understand and produce. Establishing the difficulty of vocabulary is therefore essential for creating level-appropriate content and developing valid, reliable assessment instruments. However, determining word difficulty still relies on labour-intensive processes involving expert judgment and costly pretesting, which limits scalability and slows innovation. As learning and assessment increasingly rely on digital platforms, the need for more efficient and scalable solutions is more pressing than ever.

While previous shared tasks have explored related problems such as Complex Word Identification (Paetzold and Specia, 2016; Yimam et al., 2018), Lexical Complexity Prediction (Shardlow et al., 2021) and Lexical Simplification (Shardlow et al., 2024), they were not designed with English language learners in mind and did not explore the influence of the learner’s L1 on L2 vocabulary difficulty. What is more, BEA has not hosted a language learning challenge since the Grammatical Error Correction shared task in 2019, leaving a significant gap at a time when advances in AI have transformed what is possible in educational NLP.

As a result, we believe the time is right for an L1-Aware Vocabulary Difficulty Prediction shared task, and BEA 2026 is the ideal venue to host it. This task would not only establish a common benchmark for researchers but also serve as a critical testbed to evaluate to what extent state-of-the-art NLP models perform on a problem that has traditionally required psychometric calibration methods. The findings from this shared task will play a crucial role in the development of AI-powered solutions for item writing, content generation, adaptive testing, and personalised vocabulary learning, laying the foundation for the next generation of language learning and assessment systems.

Task description

The BEA 2026 shared task aims to advance research into vocabulary difficulty prediction for learners of English with diverse L1 backgrounds, an essential step towards custom content creation, computer-adaptive testing and personalised learning. In a context where traditional item calibration methods have become a bottleneck for the implementation of digital learning and assessment systems, we believe predictive NLP models can provide a more scalable, cost-effective solution.

The goal of this shared task is to build regression models to predict the difficulty of English words given a learner's L1. We believe this new shared task provides a novel approach to vocabulary modelling, offering a multidimensional perspective that has not been explored in previous work. To this aim, we will use the British Council's Knowledge-based Vocabulary Lists (KVL), a multilingual dataset with psychometrically calibrated difficulty scores. We believe this unique dataset is not only an invaluable contribution to the NLP community but also a powerful resource that will enable in-depth investigations into how linguistic features, L1 background and contextual cues influence vocabulary difficulty.

Tracks

The shared task includes the following two tracks:

  • Closed track: Systems may only use the provided training data for each corresponding L1.
  • Open track: Systems may combine data from different L1s in any way they choose, in addition to using any other publicly available training data.

Within each track, participants can submit predictions for as many L1s as they wish, from German (DE), Spanish (ES) and Mandarin (CN). Teams who submit predictions for all three L1s will also be evaluated on general cross-L1 performance.

Baseline systems will be made available for comparison during development.

Data

The data for the shared task will be taken from the recently released 'Extended KVL Dataset for NLP' which was presented at BEA 2025 (Skidmore et al., 2025). This dataset is an adaptation of the British Council’s Knowledge-based Vocabulary Lists (KVL) (Schmitt et al., 2021, 2024), which were initially developed to collate difficulty rankings of English vocabulary for learners with L1 backgrounds of Spanish, German and Mandarin.

To create the lists, the productive English language word knowledge of over 100,000 learners was assessed using items designed to test form-based recall of individual lemmas in a translation format (cf. Laufer and Goldstein, 2004).

Below is an example test item in Spanish, where learners were required to input the remainder of the target English word 'house' (the German and Mandarin versions had similar, yet distinct prompts):

Vivo en una casa grande que tiene tres dormitorios.

casa 

h _ _ _ _

From approximately 3.3 million test responses, difficulty estimates were derived separately for each L1 background, applying random-item-random-person (RPRI) Rasch models (De Boeck, 2008) built within a generalised linear mixed model (GLMM) framework (Dunn, 2024). Further detail on the estimation of difficulty values for the KVL can be found in Schmitt et al. (2024).

Links to the datasets used in the shared task will be shared soon:

  • Training data: 6091 items per L1 (6091 x 3 = 18273 instances). 
  • Validation data: 677 items per L1 (677 x 3 = 2031 instances). 
  • Test data

All the data used in the shared task is available for public use according to the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Each dataset contains unique English vocabulary test items with prompts in three L1s: Spanish, German and Mandarin. Each dataset is provided as a set of three CSV files, one for each L1, which include the following data columns:

  • item_id: An ID number from 1 to 6,768. Items with the same item_id across different L1 files are parallel (i.e., refer to the same English target word).
  • subset: Indicates whether the data is part of the ‘train’, ‘dev’ or ‘test’ splits used by Skidmore et al. (2025), for comparison.
  • L1: The L1 of the prompt (‘es’ for Spanish, ‘de’ for German, or ‘cn’ for Mandarin).
  • en_target_word: The English target word.
  • en_target_pos: The part of speech of the English target word.
  • en_target_clue: A partial-spelling clue of the English target word.
  • L1_source_word: The corresponding L1 source word(s).
  • L1_context: The L1 contextualising prompt.
  • GLMM_score: The GLMM difficulty estimate for the vocabulary test item, as reported by Schmitt et al. (2024). This is the target value that will be predicted. · difficulty_value: The linearly scaled transformation of the GLMM score, as described by Skidmore et al. (2025).

The target variable for prediction is the GLMM_score.

Participating teams may submit up to three 'runs' per track and L1, allowing them to evaluate different system configurations. Submissions must be in a plain text file, with the predicted difficulty of each word on a new line, preserving the order in the test set. Submissions must be sent by email to vocabularychallenge@britishcouncil.org by the submission deadline.

Submissions will be evaluated using Spearman’s rank correlation coefficient (ρ), which measures the monotonic relationship between predicted and actual difficulty rankings. This metric is particularly well suited to psychometric tasks where the relative ordering of items is more informative than their absolute values. Unlike Root Mean Squared Error (RMSE), commonly used in previous work, Spearman’s ρ is bounded, more intuitive and easier to interpret (Skidmore et al., 2025). RMSE will still be reported for completeness and to enable comparison with prior work. Evaluation scripts will be provided to participants alongside the training data.

Results will be announced here after the shared task is finished.

All deadlines are 11:59pm UTC-12 (anywhere on Earth).

20 January: Training data release

20 March: Test data release

27 March: System submissions from teams due

3 April: Announcement of evaluation results by the organizers

24 April: System papers due

1 May: Paper reviews returned

12 May: Final camera-ready submissions

2-3 July: BEA 2025 workshop at ACL

Mariano Felice (British Council)

Lucy Skidmore (British Council)

The British Council is the United Kingdom's international organisation for cultural relations and educational opportunities, with over 90 years of experience promoting English language learning and assessment worldwide. Operating in more than 100 countries, it is recognised as a global leader in English education and a founding partner of IELTS, one of the world's most trusted language proficiency tests.

Paul De Boeck. 2008. Random item IRT models. Psychometrika, 73(4):533–559.

Karen J. Dunn. 2024. Random-item Rasch models and explanatory extensions: A worked example using L2 vocabulary test item responses. Research Methods in Applied Linguistics, 3(3):100143.

Batia Laufer and Zahava Goldstein. 2004. Testing vocabulary knowledge: Size, strength, and computer adaptiveness. Language learning, 54(3):399–436.

Gustavo Paetzold and Lucia Specia. 2016. SemEval 2016 task 11: Complex word identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 560–569, San Diego, California. Association for Computational Linguistics.

Norbert Schmitt, Karen Dunn, Barry O’Sullivan, Laurence Anthony, and Benjamin Kremmel. 2021. Introducing Knowledge-based Vocabulary Lists (KVL). Tesol Journal, 12(4).

Norbert Schmitt, Karen Dunn, Barry O’Sullivan, Laurence Anthony, and Benjamin Kremmel. 2024. Knowledge-based Vocabulary Lists. University of Toronto Press, Toronto.

Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, Rémi Cardon, Thomas François, Akio Hayakawa, Andrea Horbach, Anna Hülsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Peréz Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, and 3 others. 2024. The BEA 2024 shared task on the multilingual lexical simplification pipeline. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 571–589, Mexico City, Mexico. Association for Computational Linguistics.

Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, and Marcos Zampieri. 2021. SemEval- 2021 task 1: Lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1–16, Online. Association for Computational Linguistics.

Lucy Skidmore, Mariano Felice, and Karen Dunn. 2025. Transformer architectures for vocabulary test item difficulty prediction. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 160– 174, Vienna, Austria. Association for Computational Linguistics.

Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja Štajner, Anaïs Tack, and Marcos Zampieri. 2018. A report on the complex word identification shared task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 66–78, New Orleans, Louisiana. Association for Computational Linguistics.