Navigation auf uzh.ch

Suche

Department of Finance AI-assisted Grading

Qualitative Analysis

We performed a follow-up interview with all graders after the conclusion of their exam and grading process. The interviews focused on evaluating both specific platform features (rubric-based assessment, content high-lighting, answer ordering, and mathematical expression parsing) and general platform usability. Through these interviews, we gathered valuable insights about the effectiveness of AI assistance in real grading scenarios, identified areas for improvement, and collected recommendations for future implementations.

Rubric-Based Assessment

Rubric-based grading emerged as a fundamental feature for ensuring consistent and fair assessment. The Law pilot participant noted that rubrics were “generally helpful for fair grading,” while Finance graders emphasized their contribution to grading consistency. However, implementation experiences revealed important nuances. The Mathematics pilot suggested reviewing and refining rubrics after grading approximately 100 responses, as initial rubrics sometimes needed adjustment based on actual student answer patterns. Political Science graders recommended using smaller, binary rubrics rather than complex ones with partial points to improve clarity and consistency.

The Law pilot highlighted a particularly significant consideration for legal case analysis: while rubrics provided good structure, graders still needed to reference example solutions frequently, as legal responses typically follow specific structural requirements and argumentation patterns. This suggests that rubrics should complement rather than replace traditional grading resources, especially in domains with strict response for-mats.

This domain-specific insight helps demonstrate how AI-assisted grading needs to be adapted for different subject areas while maintaining core functionality.

Rubric Highlighting

Content highlighting emerged as a crucial feature with varying effectiveness across different answer types and disciplines. The Law pilot found it “very helpful for long answers” but less useful for shorter responses, particularly when highlighting overlapped. Political Science reported that highlighting “worked surprisingly well,” while Finance noted some inconsistency in highlighting accuracy.

The highlighting functionality operates by analyzing student responses and visually connecting text passages to specific rubrics through color coding. This visual aid helps graders quickly identify relevant content while maintaining their autonomy in determining whether the highlighted content actually fulfills the rubric requirements. However, pilot feedback revealed several areas for improvement:

  • Law pilot suggested highlighting complete sentences rather than just keywords
  • Finance noted inconsistent performance with unstructured responses
  • Need for better handling of overlapping highlights
  • Suggestion to improve keyword recognition (e.g., identifying “Spezialprävention” when searching for “spezialpräventiv”)

The highlighting feature operates independently of any automated grading decisions, serving purely as a cognitive aid. This design choice helps graders focus their attention while preventing the introduction of AI bias into the grading process.

Answer Ordering

The similarity-based ordering of responses proved valuable for reducing cognitive load and improving grading consistency. The Law pilot found it “an excellent idea and noticeable,” particularly appreciating how it simplified the review of similar responses compared to their usual process in Inspera, where they had to manually navigate back to find similar answers. Finance graders “loved that similar exercises were grouped,” especially when grading questions that had multiple correct solution approaches.

The Mathematics pilot confirmed the ordering was helpful, though they initially questioned whether answers were presented in their original submission order. This highlights the transparency of the feature — graders often didn't notice the ordering was AI-assisted until explicitly told, suggesting a natural integration into their workflow. The feature was particularly appreciated in scenarios where multiple correct approaches were possible. For example, in the Finance pilot, graders noted its usefulness when “there are multiple ways to get full points” — such as cases where students could argue either for moral hazard or adverse selection to receive full credit. One grader proposed adding the ability to name groups of similar answers to facilitate later review of specific answer patterns.

Mathematical Expression Parsing

Mathematical expression parsing and display proved particularly valuable in the Mathematics pilot, where formula standardization significantly enhanced grading efficiency. The Mathematics lecturer reported that “formula display helped extremely” with grading, especially when the displayed formulas matched expected solutions. The standardized LaTeX rendering allowed for quick verification of correct answers, while making it easier to identify incorrect expressions that required closer inspection.

However, the pilot also revealed areas for improvement. The lecturer suggested enhancing the parsing functionality to better recognize equivalent expressions in different forms, noting that “one would need different rubrics” to account for various correct ways of expressing the same mathematical solution. This feedback high-lights the importance of flexible formula parsing that can identify mathematically equivalent expressions regardless of their specific representation.

Grading Review and Outlier Detection

The grading review functionality received mixed feedback across different pilot implementations, though all participants recognized its potential value for quality assurance. The Political Science pilot actively used the outlier detection capabilities and found value in reviewing answers with high outlier scores. The Law pilot, while noting that the review functionality didn't perfectly align with their traditional review process focusing on borderline cases, acknowledged the potential benefits of systematic outlier detection. The Finance pilot appreciated the ability to flag and filter responses for review, suggesting this could streamline their quality assurance process.

  • Several suggestions for improvement emerged from the pilots:
  • Better filtering options for flagged responses
  • Enhanced visibility of review candidates
  • Integration of grade curve visualization
  • More flexible navigation between flagged items
  • Improved access to similar responses for comparison (“grouping”)

This feedback indicates that while the review functionality offers promising features for quality assurance, it needs to be better aligned with established review practices and made more accessible through improved user interface options. The focus should be on supporting existing quality assurance processes while maintaining the benefits of systematic outlier detection.

Platform Integration

A critical limitation emerged during our pilot phase regarding platform integration. Two pilot cases — Law and Mathematics — highlighted the importance of seamless integration with existing e-Assessment platforms. The Law pilot ultimately could not use our platform for their actual grading process because their faculty requires student feedback to be provided directly within their examination system, and because said system (Inspera) did not support importing detailed grading decisions. Similarly, while the Mathematics pilot showed interest in using our platform for a full exam, the lack of integration capabilities with OLAT prevented them from doing so, as they wanted to do student grading review through OLAT. These experiences underscore that while AI assistance offers valuable benefits, the ability to integrate with established e-Assessment platforms is crucial for practical adoption in real examination scenarios.