Reinforcement Learning-Supervised LLM Question Generation from Educational Textbooks: A Comparative Study of Prompt Engineering and Post-Hoc Filtering

Fardani Annisa Damastuti; Agustinus Bimo Gumelar; Kenan Firmansyah

doi:10.48084/etasr.17900

Authors

Fardani Annisa Damastuti Department of Creative Multimedia Technology, Electronic Engineering Polytechnic Institute of Surabaya, Surabaya, Indonesia
Agustinus Bimo Gumelar Department of Informatics, School of Information Technology, Universitas Ciputra, Surabaya, Indonesia
Kenan Firmansyah Independent Researcher

Volume: 16 | Issue: 3 | Pages: 35162-35170 | June 2026 | https://doi.org/10.48084/etasr.17900

Received: 1 February 2026 | Revised: 4 March 2026 and 16 March 2026 | Accepted: 18 March 2026 | Online: 6 June 2026

Corresponding author: Fardani Annisa Damastuti

Abstract

Large Language Models (LLMs) show promise for generating educational questions from textbook content. However, their outputs still need quality control before they can be used in classrooms. This study investigates how prompt constraint design impacts the quality of LLM questions and tests, and whether post-hoc filtering can enhance this process. A total of 566 questions were generated from Indonesian elementary school textbooks using GPT-3.5-turbo and Gemini 2.0-flash, with three different prompt constraint levels (strict, medium, and loose). The experimental results indicate that prompt engineering is the most influential factor. Strict prompts achieved 97.9% answer findability while loose prompts only reached 72.8%, which is a 25% difference. In addition, a Reinforcement Learning (RL)-based supervisor was developed as a proof-of-concept, which achieved 100% findability on accepted questions. The RL-based supervisor demonstrated similar performance compared to a simple rule-based verification method (verifying if the answer appears in the book). The findings suggest that the RL framework could be useful for more complex quality criteria in the future. Moreover, it was also revealed that story problems are approximately 20% harder than factual questions, while GPT-3.5 demonstrated better performance than Gemini 2.0 in terms of findability, achieving 87.5% compared to 84.1%. However, Gemini 2.0 performed better at matching difficulty levels.

Keywords:

automatic question generation, large language models, prompt engineering, educational technology, quality control

References

S. Guo, L. Liao, C. Li, and T.-S. Chua, "A Survey on Neural Question Generation: Methods, Applications, and Prospects." arXiv, 2024.

S. Alamoudi, L. A. Al Khuzayem, and A. Jamal, "Optimizing Automated Question Generation for Educational Assessments: A Semantic Analysis of LLMs with Structured and Unstructured Ontologies," Engineering, Technology & Applied Science Research, vol. 15, no. 3, pp. 23664–23671, Jun. 2025.

X. Du, J. Shao, and C. Cardie, "Learning to Ask: Neural Question Generation for Reading Comprehension," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1342–1352.

S. Maity and A. Deroy, "The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models." arXiv, 2024.

T. Brown et al., "Language Models are Few-Shot Learners," in Advances in Neural Information Processing Systems, Vancouver, Canada, Dec. 2020, pp. 1877–1901.

Z. Ji et al., "Survey of Hallucination in Natural Language Generation," ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, Dec. 2023.

P. Denny, S. MacNeil, J. Savelka, L. Porter, and A. Luxton-Reilly, "Desirable Characteristics for AI Teaching Assistants in Programming Education," in Proceedings of Innovation and Technology in Computer Science Education V. 1, Milan, Italy, Jul. 2024, pp. 408–414.

G. Călugăreanu, H. F. Pop, and A. Vasiu, "Matrix Invertible Extensions Over Commutative Rings. Part III: Hermite Rings." arXiv, Jul. 27, 2025.

S. Doroudi, V. Aleven, and E. Brunskill, "Where’s the Reward?: A Review of Reinforcement Learning for Instructional Sequencing," International Journal of Artificial Intelligence in Education, vol. 29, no. 4, pp. 568–620, Dec. 2019.

M. Heilman, "Automatic Factual Question Generation from Text," Carnegie Mellon University, Pittsburgh, PA, USA, 2025.

Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou, "Neural Question Generation from Text: A Preliminary Study," in Natural Language Processing and Chinese Computing, vol. 10619, X. Huang, J. Jiang, D. Zhao, Y. Feng, and Y. Hong, Eds. Cham: Springer International Publishing, 2018, pp. 662–671.

N. Mulla and P. Gharpure, "Automatic Question Generation: A Review of Methodologies, Datasets, Evaluation Metrics, and Applications," Progress in Artificial Intelligence, vol. 12, no. 1, pp. 1–32, Mar. 2023.

G. Kurdi, J. Leo, B. Parsia, U. Sattler, and S. Al-Emari, "A Systematic Review of Automatic Question Generation for Educational Purposes," International Journal of Artificial Intelligence in Education, vol. 30, no. 1, pp. 121–204, Mar. 2020.

J. Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," in Proceedings of Advances in Neural Information Processing Systems, Red Hook, NY, USA, Dec. 2022, pp. 24824–24837.

S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv, 2022.

S. Reddy, A. Dragan, and S. Levine, "Shared Autonomy via Deep Reinforcement Learning," in Robotics: Science and Systems XIV, Pittsburgh, PA, USA, Jun. 2018.

K. Mo, S. Li, Y. Zhang, J. Li, and Q. Yang, "Personalizing a Dialogue System with Transfer Reinforcement Learning." arXiv, 2016.

F. A. Damastuti, K. Firmansyah, Y. M. Arif, T. Dutono, A. Barakbah, and M. Hariadi, "Dynamic Level of Difficulties Using Q-Learning and Fuzzy Logic," IEEE Access, vol. 12, pp. 137775–137789, 2024.

L. Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback." arXiv, 2022.

Y. Bai et al., "Constitutional AI: Harmlessness from AI Feedback." arXiv, Dec. 15, 2022.

X. Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models," in International Conference on Learning Representations, Kigali, Rwanda, 2023.

P. F. Christiano et al., "Deep Reinforcement Learning from Human Preferences," in Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4299–4307.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, "SQuAD: 100,000+ Questions for Machine Comprehension of Text," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 2016, pp. 2383–2392.

P. Rajpurkar, R. Jia, and P. Liang, "Know What You Don’t Know: Unanswerable Questions for SQuAD," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 784–789.

S. Sugawara, Y. Kido, H. Yokono, and A. Aizawa, "Evaluation Metrics for Machine Reading Comprehension: Prerequisite Skills and Readability," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 806–817.

R. J. Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning," Machine Learning, vol. 8, no. 3–4, pp. 229–256, May 1992.

"Buku Sekolah Elektronik (BSE)," Myedisi Interaktif Media, 2022. https://buku.kemdikbud.go.id.

F. A. Damastuti, A. B. Gumelar, and K. Firmansyah, "Back to School Dataset." GitHub, 2025, [Online]. Available: https://github.com/Kenanfir/BackToSchool-Dataset/blob/main/rl_llm_question_dataset_566.csv.