BEA 2026 Shared Task

Vocabulary Difficulty Prediction for English Learners

Motivation

Vocabulary is a crucial aspect of language knowledge, shaping what learners can understand and produce. Establishing the difficulty of vocabulary is therefore essential for creating level-appropriate content and developing valid, reliable assessment instruments. However, determining word difficulty still relies on labour-intensive processes involving expert judgment and costly pretesting, which limits scalability and slows innovation. As learning and assessment increasingly rely on digital platforms, the need for more efficient and scalable solutions is more pressing than ever.

While previous shared tasks have explored related problems such as Complex Word Identification (Paetzold and Specia, 2016; Yimam et al., 2018), Lexical Complexity Prediction (Shardlow et al., 2021) and Lexical Simplification (Shardlow et al., 2024), they were not designed with English language learners in mind and did not explore the influence of the learner’s L1 on L2 vocabulary difficulty. What is more, BEA has not hosted a language learning challenge since the Grammatical Error Correction shared task in 2019, leaving a significant gap at a time when advances in AI have transformed what is possible in educational NLP.

As a result, we believe the time is right for an L1-Aware Vocabulary Difficulty Prediction shared task, and BEA 2026 is the ideal venue to host it. This task would not only establish a common benchmark for researchers but also serve as a critical testbed to evaluate to what extent state-of-the-art NLP models perform on a problem that has traditionally required psychometric calibration methods. The findings from this shared task will play a crucial role in the development of AI-powered solutions for item writing, content generation, adaptive testing, and personalised vocabulary learning, laying the foundation for the next generation of language learning and assessment systems.

Task description

The BEA 2026 shared task aims to advance research into vocabulary difficulty prediction for learners of English with diverse L1 backgrounds, an essential step towards custom content creation, computer-adaptive testing and personalised learning. In a context where traditional item calibration methods have become a bottleneck for the implementation of digital learning and assessment systems, we believe predictive NLP models can provide a more scalable, cost-effective solution.

The goal of this shared task is to build regression models to predict the difficulty of English words given a learner's L1. We believe this new shared task provides a novel approach to vocabulary modelling, offering a multidimensional perspective that has not been explored in previous work. To this aim, we will use the British Council's Knowledge-based Vocabulary Lists (KVL), a multilingual dataset with psychometrically calibrated difficulty scores. We believe this unique dataset is not only an invaluable contribution to the NLP community but also a powerful resource that will enable in-depth investigations into how linguistic features, L1 background and contextual cues influence vocabulary difficulty.

The shared task includes the following two tracks:

Closed track

The Closed Track is designed to elicit models that rely primarily on the provided dataset and standard, publicly available NLP tools. The goal is to explore optimal model design under controlled data conditions.

Data: Systems may only use the provided training data for each corresponding L1. You may not add additional training data or combine data from different L1s.

Models, tools and databases: You may use publicly available, 'off-the-shelf' pre-trained transformer models (e.g. BERT, RoBERTa, ELECTRA) and their embeddings, standard NLP tools (e.g. taggers, parsers, spellcheckers) and publicly available linguistic databases (e.g. WordNet, frequency lists) to derive features from the provided training data.

LLMs: Large Language Models are not allowed in this track due to their access to vast external knowledge and potential data leakage.

Open track

The Open Track is the 'anything goes' category, allowing for maximum flexibility and the use of external resources. The purpose of this track is to encourage creativity and explore the full potential of current AI technology.

Data: You may use any additional training data. This includes public corpora, proprietary datasets, synthetic data, or combining data from different L1s.

Models, tools and databases: No restrictions.

LLMs: LLMs are allowed in this track.

Within each track, participants can submit predictions for as many L1s as they wish, from German (DE), Spanish (ES) and Mandarin (CN).

The data for the shared task will be taken from the recently released 'Extended KVL Dataset for NLP' which was presented at BEA 2025 (Skidmore et al., 2025). This dataset is an adaptation of the British Council’s Knowledge-based Vocabulary Lists (KVL) (Schmitt et al., 2021, 2024), which were initially developed to collate difficulty rankings of English vocabulary for learners with L1 backgrounds of Spanish, German and Mandarin.

To create the lists, the productive English language word knowledge of over 100,000 learners was assessed using items designed to test form-based recall of individual lemmas in a translation format (cf. Laufer and Goldstein, 2004).

Below is an example test item in Spanish, where learners were required to input the remainder of the target English word 'house' (the German and Mandarin versions had similar, yet distinct prompts):

Vivo en una casa grande que tiene tres dormitorios.

casa 

h _ _ _ _

From approximately 3.3 million test responses, difficulty estimates were derived separately for each L1 background, applying random-item-random-person (RPRI) Rasch models (De Boeck, 2008) built within a generalised linear mixed model (GLMM) framework (Dunn, 2024). Further detail on the estimation of difficulty values for the KVL can be found in Schmitt et al. (2024).

  Number of words Number of L1s Total instances
Training data 6,091 3 18,273
Development data 677 3 2,031
Test data N/A N/A N/A

All the data used in the shared task is available for public use according to the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Each dataset contains unique English vocabulary test items with prompts in three L1s: Spanish, German and Mandarin. The training, development and test datasets are provided as a set of CSV files, one for each L1, which include the following data columns:

  • item_id: An ID number from 1 to 6,768. Items with the same item_id across different L1 files are parallel (i.e., refer to the same English target word).
  • L1: The L1 of the prompt (‘es’ for Spanish, ‘de’ for German, or ‘cn’ for Mandarin).
  • en_target_word: The English target word.
  • en_target_pos: The part of speech of the English target word.
  • en_target_clue: A partial-spelling clue of the English target word.
  • L1_source_word: The corresponding L1 source word(s).
  • L1_context: The L1 contextualising prompt.
  • GLMM_score: The GLMM difficulty estimate for the vocabulary test item, as reported by Schmitt et al. (2024). This is the target value that will be predicted.

The target variable for prediction is the GLMM_score.

We encourage all prospective participants to join the BEA 2026 Shared Task Google Group, which serves as our primary hub for news, enquiries and community discussion.

To participate in the shared task, teams must submit their system output using the form that will become available during the test phase period.

Participating teams may submit up to three 'runs' per track and L1, allowing them to evaluate different system configurations. Submissions must be provided as a CSV file, with two columns:

  • item_id: the ID of the word in the test set for the corresponding L1, and
  • prediction: the predicted difficulty (GLMM score) of the word.

Please ensure your CSV files contain headers for compatibility with the evaluation script.

Example prediction file:

item_id,prediction

1,0.97

2,1.34

3,0.18

4,1.77

5,0.44

6,1.05

7,0.63

8,1.21

9,1.86

10,0.39

All the data, baseline models and evaluation script can be found at https://github.com/britishcouncil/bea2026st

Submissions will be evaluated using Root Mean Squared Error (RMSE), for consistency with previous work. We will also report Pearson correlation for completeness.

Systems will be ranked based on RMSE, with leaderboards produced for each L1 and track.

Results will be announced here once the test period has ended.

All deadlines are 11:59pm UTC-12 (anywhere on Earth).

26 January: Training data release

20 March: Test data release

27 March: System submissions from teams due

3 April: Announcement of evaluation results by the organizers

24 April: System papers due

1 May: Paper reviews returned

12 May: Final camera-ready submissions

2-3 July: BEA 2026 workshop at ACL

Mariano Felice (British Council)

Lucy Skidmore (British Council)

Please send any questions to vocabularychallenge@britishcouncil.org or post a new topic in our forum.

The British Council is the United Kingdom's international organisation for cultural relations and educational opportunities, with over 90 years of experience promoting English language learning and assessment worldwide. Operating in more than 100 countries, it is recognised as a global leader in English education and a founding partner of IELTS, one of the world's most trusted language proficiency tests.

Paul De Boeck. 2008. Random item IRT models. Psychometrika, 73(4):533–559.

Karen J. Dunn. 2024. Random-item Rasch models and explanatory extensions: A worked example using L2 vocabulary test item responses. Research Methods in Applied Linguistics, 3(3):100143.

Batia Laufer and Zahava Goldstein. 2004. Testing vocabulary knowledge: Size, strength, and computer adaptiveness. Language learning, 54(3):399–436.

Gustavo Paetzold and Lucia Specia. 2016. SemEval 2016 task 11: Complex word identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 560–569, San Diego, California. Association for Computational Linguistics.

Norbert Schmitt, Karen Dunn, Barry O’Sullivan, Laurence Anthony, and Benjamin Kremmel. 2021. Introducing Knowledge-based Vocabulary Lists (KVL). Tesol Journal, 12(4).

Norbert Schmitt, Karen Dunn, Barry O’Sullivan, Laurence Anthony, and Benjamin Kremmel. 2024. Knowledge-based Vocabulary Lists. University of Toronto Press, Toronto.

Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, Rémi Cardon, Thomas François, Akio Hayakawa, Andrea Horbach, Anna Hülsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Peréz Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, and 3 others. 2024. The BEA 2024 shared task on the multilingual lexical simplification pipeline. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 571–589, Mexico City, Mexico. Association for Computational Linguistics.

Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, and Marcos Zampieri. 2021. SemEval- 2021 task 1: Lexical complexity prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1–16, Online. Association for Computational Linguistics.

Lucy Skidmore, Mariano Felice, and Karen Dunn. 2025. Transformer architectures for vocabulary test item difficulty prediction. In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 160– 174, Vienna, Austria. Association for Computational Linguistics.

Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja Štajner, Anaïs Tack, and Marcos Zampieri. 2018. A report on the complex word identification shared task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 66–78, New Orleans, Louisiana. Association for Computational Linguistics.