Prompt Engineering to CEFR Alignment: Investigating Generative AI for the Creation of English Listening Assessments

Fikri Asih Wigati; Putri Kamalia Hakim; Nia Pujiawati; Maya Rahmawati

doi:10.31605/eduvelop.v9i1.6207

Authors

Fikri Asih Wigati a:1:{s:5:"en_US";s:35:"Universitas Singaperbangsa Karawang";}
Putri Kamalia Hakim
Nia Pujiawati
Maya Rahmawati

DOI:

https://doi.org/10.31605/eduvelop.v9i1.6207

Keywords:

Generative AI, ChatGPT-4, CEFR, listening assessment, prompt engineering, human–AI collaboration

Abstract

Meeting the increasing demand for internationally benchmarked English listening exams is difficult, especially in educational settings with limited resources. In a human–AI collaboration framework, this study investigates the feasibility of using generative artificial intelligence, specifically ChatGPT-4, to support the early development of English listening scripts and test items aligned with the CEFR. Using an exploratory research design, the study generated 20 listening scripts and matching multiple-choice questions across CEFR reference levels A2, B1, B2, and C1 using an iterative prompt engineering technique called Progressive-Hint Prompting (PHP). The produced materials were examined using Text Inspector's descriptive linguistic metrics, which included qualitative assessments of spoken discourse characteristics, topical coverage, and distractor plausibility, as well as lexical profile, readability, and script length. The results show that when guided by structured prompts and ongoing human evaluation, ChatGPT-4 can perform well as a drafting aid. The created scripts demonstrated systematic linguistic variance across CEFR reference levels, particularly in lexical range and text complexity. Nevertheless, several drawbacks were noted, including unequal topical distribution, decreased pragmatic naturalness at higher competence levels, and inconsistent calibration of spoken discourse features. To ensure that distractors were text-based and aligned with assessment criteria, item quality needed to be refined iteratively. These results imply that iterative human-AI interaction, rather than automated generation alone, determines the quality of AI-generated listening materials. The study emphasizes the ongoing importance of professional human oversight while highlighting the potential of generative AI as a resource-efficient support tool for the development of listening assessments. To investigate the efficacy of AI-assisted materials in operational assessment contexts, future research should focus on empirical validation with test takers.

Downloads

Download data is not yet available.

References

Aryadoust, V., & Luo, L. (2023). The typology of second language listening constructs: A systematic review. Language Testing, 40(2), 375–409. https://doi.org/10.1177/02655322221126604

Aryadoust, V., Zakaria, A., & Jia, Y. (2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, 100204. https://doi.org/10.1016/j.caeai.2024.100204

Clark, H. H., & Fox Tree, J. E. (2002). Using uh and um in spontaneous dialog. Cognition, 84(1), 73–111. https://doi.org/10.1016/S0010-0277(02)00017-3

Coleman, H., Ahmad, N. F., Hadisantosa, N., Kuchah, K., Lamb, M., & Waskita, D. (2024). Common sense and resistance: EMI policy and practice in Indonesian universities. Current Issues in Language Planning, 25(1), 23–44. https://doi.org/10.1080/14664208.2023.2205792

Field, J. (2008). Listening in the language classroom. Cambridge University Press.

Jiang, Y., et al. (2024). Evaluating the critical thinking of large language models: Insights and limitations. Journal of Pacific Rim Psychology. https://doi.org/10.1177/18344909251406111

McKinley, J., & Rose, H. (2017). The Routledge handbook of English language teaching. Routledge.

Mead, A. D., & Zhou, C. (2023). Evaluating the quality of AI-generated items for a certification exam. Journal of Applied Testing Technology, 24(Special Issue), 1–14.

Nasr, N. R., Tu, C.-H., Werner, J., Bauer, T., Yen, C.-J., & Sujo-Montes, L. (2025). Exploring the impact of generative AI ChatGPT on critical thinking in higher education: Passive AI-directed use or human–AI supported collaboration? Education Sciences, 15(9), 1198. https://doi.org/10.3390/educsci15091198

Nurhayati, N., Setiawaty, P. W., & Nur, S. (2024). EFL teachers’ challenges in designing assessment material for students’ listening skills. ENGLISH FRANCA: Academic Journal of English Language and Education, 8(2), 409–422. https://doi.org/10.29240/ef.v8i2.12053

OpenAI. (2023). ChatGPT (GPT-4 version) [Large language model]. https://chat.openai.com

Peláez-Sánchez, I. C., Velarde-Camaqui, D., & Glasserman-Morales, L. D. (2024). The impact of large language models on higher education: Exploring the connection between AI and Education 4.0. Frontiers in Education, 9, 1392091. https://doi.org/10.3389/feduc.2024.1392091

Richardson, A. (2022). Advances in OpenAI’s GPT-3 applications. International Journal of Artificial Intelligence and Machine Learning in Engineering, 21(1), 742–749.

Sawaki, Y., Kim, H.-J., & Gentile, C. (2009). Q-matrix construction: Defining the link between constructs and test items in large-scale reading and listening comprehension assessments. Language Assessment Quarterly, 6(3), 190–209. https://doi.org/10.1080/15434300902801917

Sung, H., Chang, T., & Huang, J. (2015). Factors affecting item difficulty in English listening comprehension tests. Universal Journal of Educational Research, 3(7), 451–459. https://doi.org/10.13189/ujer.2015.030704

Taylor, L., & Geranpayeh, A. (2011). Assessing listening for academic purposes: Defining and operationalising the test construct. Journal of English for Academic Purposes, 10(2), 89–101. https://doi.org/10.1016/j.jeap.2011.03.002

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://arxiv.org/abs/2201.11903

Zheng, C., Liu, Z., Xie, E., Li, Z., & Li, Y. (2023). Progressive-hint prompting improves reasoning in large language models. arXiv. https://arxiv.org/abs/2304.09797