Main Article Content

Abstract

Meeting the increasing demand for internationally benchmarked English listening exams is difficult, especially in educational settings with limited resources. In a human–AI collaboration framework, this study investigates the feasibility of using generative artificial intelligence, specifically ChatGPT-4, to support the early development of English listening scripts and test items aligned with the CEFR. Using an exploratory research design, the study generated 20 listening scripts and matching multiple-choice questions across CEFR reference levels A2, B1, B2, and C1 using an iterative prompt engineering technique called Progressive-Hint Prompting (PHP). The produced materials were examined using Text Inspector's descriptive linguistic metrics, which included qualitative assessments of spoken discourse characteristics, topical coverage, and distractor plausibility, as well as lexical profile, readability, and script length. The results show that when guided by structured prompts and ongoing human evaluation, ChatGPT-4 can perform well as a drafting aid. The created scripts demonstrated systematic linguistic variance across CEFR reference levels, particularly in lexical range and text complexity. Nevertheless, several drawbacks were noted, including unequal topical distribution, decreased pragmatic naturalness at higher competence levels, and inconsistent calibration of spoken discourse features. To ensure that distractors were text-based and aligned with assessment criteria, item quality needed to be refined iteratively. These results imply that iterative human-AI interaction, rather than automated generation alone, determines the quality of AI-generated listening materials. The study emphasizes the ongoing importance of professional human oversight while highlighting the potential of generative AI as a resource-efficient support tool for the development of listening assessments. To investigate the efficacy of AI-assisted materials in operational assessment contexts, future research should focus on empirical validation with test takers.

Keywords

Generative AI ChatGPT-4 CEFR listening assessment prompt engineering human–AI collaboration

Article Details

How to Cite
Wigati, F. A., Putri Kamalia Hakim, Nia Pujiawati, & Maya Rahmawati. (2026). Prompt Engineering to CEFR Alignment: Investigating Generative AI for the Creation of English Listening Assessments. Eduvelop: Journal of English Education and Development , 9(1), 329-339. https://doi.org/10.31605/eduvelop.v9i1.6207

References

  1. Aryadoust, V., & Luo, L. (2023). The typology of second language listening constructs: A systematic review. Language Testing, 40(2), 375–409. https://doi.org/10.1177/02655322221126604
  2. Aryadoust, V., Zakaria, A., & Jia, Y. (2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, 100204. https://doi.org/10.1016/j.caeai.2024.100204
  3. Clark, H. H., & Fox Tree, J. E. (2002). Using uh and um in spontaneous dialog. Cognition, 84(1), 73–111. https://doi.org/10.1016/S0010-0277(02)00017-3
  4. Coleman, H., Ahmad, N. F., Hadisantosa, N., Kuchah, K., Lamb, M., & Waskita, D. (2024). Common sense and resistance: EMI policy and practice in Indonesian universities. Current Issues in Language Planning, 25(1), 23–44. https://doi.org/10.1080/14664208.2023.2205792
  5. Field, J. (2008). Listening in the language classroom. Cambridge University Press.
  6. Jiang, Y., et al. (2024). Evaluating the critical thinking of large language models: Insights and limitations. Journal of Pacific Rim Psychology. https://doi.org/10.1177/18344909251406111
  7. McKinley, J., & Rose, H. (2017). The Routledge handbook of English language teaching. Routledge.
  8. Mead, A. D., & Zhou, C. (2023). Evaluating the quality of AI-generated items for a certification exam. Journal of Applied Testing Technology, 24(Special Issue), 1–14.
  9. Nasr, N. R., Tu, C.-H., Werner, J., Bauer, T., Yen, C.-J., & Sujo-Montes, L. (2025). Exploring the impact of generative AI ChatGPT on critical thinking in higher education: Passive AI-directed use or human–AI supported collaboration? Education Sciences, 15(9), 1198. https://doi.org/10.3390/educsci15091198
  10. Nurhayati, N., Setiawaty, P. W., & Nur, S. (2024). EFL teachers’ challenges in designing assessment material for students’ listening skills. ENGLISH FRANCA: Academic Journal of English Language and Education, 8(2), 409–422. https://doi.org/10.29240/ef.v8i2.12053
  11. OpenAI. (2023). ChatGPT (GPT-4 version) [Large language model]. https://chat.openai.com
  12. Peláez-Sánchez, I. C., Velarde-Camaqui, D., & Glasserman-Morales, L. D. (2024). The impact of large language models on higher education: Exploring the connection between AI and Education 4.0. Frontiers in Education, 9, 1392091. https://doi.org/10.3389/feduc.2024.1392091
  13. Richardson, A. (2022). Advances in OpenAI’s GPT-3 applications. International Journal of Artificial Intelligence and Machine Learning in Engineering, 21(1), 742–749.
  14. Sawaki, Y., Kim, H.-J., & Gentile, C. (2009). Q-matrix construction: Defining the link between constructs and test items in large-scale reading and listening comprehension assessments. Language Assessment Quarterly, 6(3), 190–209. https://doi.org/10.1080/15434300902801917
  15. Sung, H., Chang, T., & Huang, J. (2015). Factors affecting item difficulty in English listening comprehension tests. Universal Journal of Educational Research, 3(7), 451–459. https://doi.org/10.13189/ujer.2015.030704
  16. Taylor, L., & Geranpayeh, A. (2011). Assessing listening for academic purposes: Defining and operationalising the test construct. Journal of English for Academic Purposes, 10(2), 89–101. https://doi.org/10.1016/j.jeap.2011.03.002
  17. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://arxiv.org/abs/2201.11903
  18. Zheng, C., Liu, Z., Xie, E., Li, Z., & Li, Y. (2023). Progressive-hint prompting improves reasoning in large language models. arXiv. https://arxiv.org/abs/2304.09797

DB Error: Unknown column 'Array' in 'WHERE'