This report comes from a Research in Reading Grant and explores how expert English for Academic Purposes (EAP) teachers evaluate reading texts compared with automated tools like Lexile, Coh-Metrix, and ChatGPT. The study focuses on where human and automated assessments line up and where they differ.

As automated tools become more common in reading assessment, it’s important to understand how their evaluations compare to those of experienced teachers. Instead of treating a teacher’s judgement as a single score, this study looks at the many factors that shape expert judgement. The findings show that while automated tools can give a general sense of text difficulty, they often miss key elements that teachers notice. If used without human oversight, these tools can produce misleading judgements about whether a text is suitable for EAP reading tests — and may even affect test validity.

The study also highlights ways Natural Language Processing (NLP) systems could be improved to better reflect expert human judgement.

Key findings:

  • Teachers’ views of grammatical (syntactic) complexity are closely tied to how they perceive vocabulary difficulty. When words are familiar or relevant to the topic, a text feels easier—even if the vocabulary is technically complex. Automated tools often overlook this.
  • Human and automated evaluations align most closely for straightforward 'benchmark' texts, but they differ when texts are abstract, highly subject-specific, or have a distinctive style.
  • Even with training, ChatGPT has clear limits in accurately and consistently assigning CEFR levels to texts.
  • The research highlights the need for context-aware, user-focused approaches to automated text analysis and shows how understanding expert teachers’ reasoning can guide the development of better tools in the future.

Citation

Unaldi, A., & Ateş, B. (2025). Evaluating text suitability for EAP reading assessment: Teachers versus Lexile, Coh-Metrix and ChatGPT. British Council. https://doi.org/10.57884/zass-nw57