Main Article Content
Abstract
Meeting the increasing demand for internationally benchmarked English listening exams is difficult, especially in educational settings with limited resources. In a human–AI collaboration framework, this study investigates the feasibility of using generative artificial intelligence, specifically ChatGPT-4, to support the early development of English listening scripts and test items aligned with the CEFR. Using an exploratory research design, the study generated 20 listening scripts and matching multiple-choice questions across CEFR reference levels A2, B1, B2, and C1 using an iterative prompt engineering technique called Progressive-Hint Prompting (PHP). The produced materials were examined using Text Inspector's descriptive linguistic metrics, which included qualitative assessments of spoken discourse characteristics, topical coverage, and distractor plausibility, as well as lexical profile, readability, and script length. The results show that when guided by structured prompts and ongoing human evaluation, ChatGPT-4 can perform well as a drafting aid. The created scripts demonstrated systematic linguistic variance across CEFR reference levels, particularly in lexical range and text complexity. Nevertheless, several drawbacks were noted, including unequal topical distribution, decreased pragmatic naturalness at higher competence levels, and inconsistent calibration of spoken discourse features. To ensure that distractors were text-based and aligned with assessment criteria, item quality needed to be refined iteratively. These results imply that iterative human-AI interaction, rather than automated generation alone, determines the quality of AI-generated listening materials. The study emphasizes the ongoing importance of professional human oversight while highlighting the potential of generative AI as a resource-efficient support tool for the development of listening assessments. To investigate the efficacy of AI-assisted materials in operational assessment contexts, future research should focus on empirical validation with test takers.
Keywords
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
- Aryadoust, V., & Luo, L. (2023). The typology of second language listening constructs: A systematic review. Language Testing, 40(2), 375–409. https://doi.org/10.1177/02655322221126604
- Aryadoust, V., Zakaria, A., & Jia, Y. (2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, 100204. https://doi.org/10.1016/j.caeai.2024.100204
- Clark, H. H., & Fox Tree, J. E. (2002). Using uh and um in spontaneous dialog. Cognition, 84(1), 73–111. https://doi.org/10.1016/S0010-0277(02)00017-3
- Coleman, H., Ahmad, N. F., Hadisantosa, N., Kuchah, K., Lamb, M., & Waskita, D. (2024). Common sense and resistance: EMI policy and practice in Indonesian universities. Current Issues in Language Planning, 25(1), 23–44. https://doi.org/10.1080/14664208.2023.2205792
- Field, J. (2008). Listening in the language classroom. Cambridge University Press.
- Jiang, Y., et al. (2024). Evaluating the critical thinking of large language models: Insights and limitations. Journal of Pacific Rim Psychology. https://doi.org/10.1177/18344909251406111
- McKinley, J., & Rose, H. (2017). The Routledge handbook of English language teaching. Routledge.
- Mead, A. D., & Zhou, C. (2023). Evaluating the quality of AI-generated items for a certification exam. Journal of Applied Testing Technology, 24(Special Issue), 1–14.
- Nasr, N. R., Tu, C.-H., Werner, J., Bauer, T., Yen, C.-J., & Sujo-Montes, L. (2025). Exploring the impact of generative AI ChatGPT on critical thinking in higher education: Passive AI-directed use or human–AI supported collaboration? Education Sciences, 15(9), 1198. https://doi.org/10.3390/educsci15091198
- Nurhayati, N., Setiawaty, P. W., & Nur, S. (2024). EFL teachers’ challenges in designing assessment material for students’ listening skills. ENGLISH FRANCA: Academic Journal of English Language and Education, 8(2), 409–422. https://doi.org/10.29240/ef.v8i2.12053
- OpenAI. (2023). ChatGPT (GPT-4 version) [Large language model]. https://chat.openai.com
- Peláez-Sánchez, I. C., Velarde-Camaqui, D., & Glasserman-Morales, L. D. (2024). The impact of large language models on higher education: Exploring the connection between AI and Education 4.0. Frontiers in Education, 9, 1392091. https://doi.org/10.3389/feduc.2024.1392091
- Richardson, A. (2022). Advances in OpenAI’s GPT-3 applications. International Journal of Artificial Intelligence and Machine Learning in Engineering, 21(1), 742–749.
- Sawaki, Y., Kim, H.-J., & Gentile, C. (2009). Q-matrix construction: Defining the link between constructs and test items in large-scale reading and listening comprehension assessments. Language Assessment Quarterly, 6(3), 190–209. https://doi.org/10.1080/15434300902801917
- Sung, H., Chang, T., & Huang, J. (2015). Factors affecting item difficulty in English listening comprehension tests. Universal Journal of Educational Research, 3(7), 451–459. https://doi.org/10.13189/ujer.2015.030704
- Taylor, L., & Geranpayeh, A. (2011). Assessing listening for academic purposes: Defining and operationalising the test construct. Journal of English for Academic Purposes, 10(2), 89–101. https://doi.org/10.1016/j.jeap.2011.03.002
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://arxiv.org/abs/2201.11903
- Zheng, C., Liu, Z., Xie, E., Li, Z., & Li, Y. (2023). Progressive-hint prompting improves reasoning in large language models. arXiv. https://arxiv.org/abs/2304.09797
