Navigation auf uzh.ch

Suche

Department of Finance AI-assisted Grading

Evaluation

Having implemented these features based on our research findings and regulatory requirements, we conducted a systematic evaluation to assess their effectiveness in real examination scenarios. Our evaluation strategy focused on two key aspects: pilot implementations across different disciplines and experimental validation of specific features. This two-step approach allowed us to validate both the practical utility of our platform and its success in addressing the challenges identified in our initial research, particularly regarding workload reduction and grading consistency.

Pilot Implementation

During pilot implementation, our AI-assisted grading platform was used by lecturers who graded their real exams (three groups of graders and exams) or who tested our platform after the grading of their exam on another platform (two test cases). For each pilot, we first assisted the lecturers with the initial preparations for AI-grading (e.g., reviewing grading rubrics for AI-compatibility, generating grading rubrics using AI).

After the exam, exam responses were then processed using AI (either GPT-4o or Claude 3.5 Sonnet) with a standardized prompting approach before human grading began. This initial AI processing enabled features like content highlighting and answer ordering to support the human grading process on the grading platform. Exams were then graded by the pilot participants and follow-up interviews allowed us to judge the achievement of our goals (efficiency and fairness) as well as the general usability of the platform.

Our pilots encompassed four distinct types of open-ended responses, each presenting unique challenges for AI assistance:

  • Short Answers: Responses consisting of a few sentences, testing concise expression and key concept understanding
  • Long Essays: Extended responses up to 900 words, requiring complex argument evaluation
  • Legal Case Analysis: Structured essays following specific analytical frameworks
  • Math-based Questions: Combined textual explanations with mathematical formulas

We specifically focused on unstructured free-text responses, as these present the greatest challenges for consistent and efficient grading. While other response types, such as coding questions, can benefit from AI assistance, we excluded these from our study as effective automated solutions already exist for such structured formats.

The ethical considerations were identical for all use cases. While AI/Large Language Models have been used to assist in grading (e.g., using the features we have already described), grading has been performed manually by human graders to ensure fairness and accuracy.

  Finance Political Science Political Science Mathematics Law
Platform OLAT OLAT OLAT OLAT Inspera
Assessment Type

Short Answers

Long Essays Long Essays Math-based Questions Legal Case Analysis
Number of Students 200 130 100

1'200 total

150 items graded on our platform

70 total

12 re-graded on our platform

Exam Date Jan. 2024 May 2024 May 2024 June 2024 June 2024
Unique Aspects 17 questions, manually created rubrics, 3 graders 1 question, 10 AI-generated rubrics with partial points 2 questions, 15-17 AI-generated rubrics with partial points Dual representation of mathematical formulas (LaTeX and original) Usage of manually assigned keywords

Feedback from Pilots

The qualitative analysis through pilot interviews revealed valuable insights into the practical implementation of AI-assisted grading across different academic disciplines. Through detailed feedback from Law, Finance, Political Science, and Mathematics pilots, we evaluated core platform features including rubric-based assessment, content highlighting, answer ordering, and mathematical expression parsing. Each feature demonstrated distinct benefits while also highlighting discipline-specific requirements and areas for improvement.

The interviews particularly emphasized the importance of maintaining human oversight, adapting to different grading practices, and ensuring seamless integration with existing e-Assessment platforms. This comprehensive evaluation provides crucial insights for future development and implementation of AI assistance in academic assessment.

Qualitative Analysis

Experimental Evaluation

Through systematic evaluation of multiple AI models, prompting strategies, and assessment types, we measured key performance metrics including grading accuracy and consistency. The experiments compared AI-generated grades with human-graded ground truth, revealing important distinctions between model capabilities and their optimal use cases. While larger models demonstrated superior performance with complex assessments, even smaller models proved valuable when appropriately configured, offering practical insights for implementing AI assistance in educational assessment.

Quantitative Analysis

Conclusion

Our evaluation combined pilot implementations with experimental validation to assess the effectiveness of our AI-assisted grading approach. The results provide encouraging insights into the potential benefits of AI assistance in exam grading while maintaining human oversight and judgment.

Key Findings

Pilot feedback revealed varying perceptions of different AI-assisted features. Content highlighting received particularly positive feedback, aligning with our interview findings where participants explicitly stated they “look for keywords when grading.” Four out of five interviewed lecturers stated they would “definitely use” an AI-assisted grading tool, with the fifth participant indicating willingness to use it for large exams.

Our experimental validation focused primarily on the technical aspects of AI-assisted grading, showing agreement rates above 80% with human graders in most test cases. Importantly, we observed that AI tends to be more lenient in grading compared to human evaluators, reinforcing our decision to focus on assistance rather than automation. These findings support our human-centric approach while highlighting the current limitations of AI in assessment.

A notable observation from the pilots was that basic workflow improvements and structured rubric implementation already provided tangible benefits compared to traditional OLAT grading methods. This aligns with our interview findings where three out of five participants emphasized the importance of predefined grading criteria, with one participant noting that “having computers try to grade exams is a waste of time and resources” but that supporting tools for workflow optimization would be valuable.

Achievement of Goals

Our platform addressed several key challenges identified in our interviews, particularly around workflow optimization and grading consistency. Pilot participants reported positive experiences with the rubric features and content highlighting, reflecting interview feedback where participants expressed willingness to “apply certain structural restrictions to their exams to reduce the grading workload.” The human-centric design of our AI assistance features aligned well with interview findings that emphasized the importance of maintaining grader autonomy.

Limitations and Future Directions

Our evaluation revealed various areas for future improvement. The current implementation shows limitations in handling specialized technical vocabulary and complex mathematical proofs. The varying perceived utility of different features suggests room for improvement in making benefits more apparent to users, particularly for subtle workflow enhancements like answer ordering.

Looking forward, our initial results suggest promising directions for future development. Interview participants proposed various valuable features for future implementation, including adaptive learning from grading patterns, enhanced visual feedback during grading, and automatic plot comparison capabilities. These developments will continue to focus on supporting, rather than replacing, human expertise in educational assessment.

Grid containing content elements

Bereichs-Navigation