Quantitative Analysis

After completing all the pilots and collecting human-graded responses for the exams as a ground truth, we conducted extensive experiments to evaluate different AI grading configurations. These experiments focused on metrics that could be measured (e.g., the consistency of AI models when grading, the quality of the ordering, the correlation between AI-based grading and human-graded responses, etc.). Through the experimental evaluation, we were able to gather some practical learnings related to the prompts, parameters, and AI models that work well when using them for AI-assistance (e.g., which types of prompts work well for which models).

Experimental Setup

We conducted extensive experiments to evaluate different AI configurations:

Multiple AI models (GPT-4o, GPT-4o-mini, Claude Sonnet 3.5, Claude Haiku 3.5, Llama 3.1)
Various prompting strategies (single comprehensive prompt vs. individual rubric prompts)
Different prompt structures (with/without reasoning, with/without explicit scales)

For each configuration, we ran three grading iterations and used the averaged results to ensure consistent evaluation. The AI-generated grading vectors were compared with human-graded ground truth using several metrics:

Mean absolute error to measure overall grading accuracy
Root mean squared error to identify significant grading discrepancies
Accuracy scores for binary (yes/no) rubrics (only in the Finance exam)
Correlation between AI and human-assigned total points
Context switch cost to evaluate answer ordering effectiveness

The context switch metric is particularly important for our use case, as it quantifies how much mental effort graders need to expend when transitioning between consecutive answers. A lower context switch score indicates that consecutive answers are more similar, reducing cognitive load during grading.

This systematic evaluation helps us understand the practical applicability of AI assistance in exam grading, while maintaining focus on supporting rather than replacing human graders.

Prompting Strategies

Our evaluation tested several prompting strategies, each building upon a baseline approach with increasing complexity:

Basic Prompting
The baseline prompt provided a simple structure where the AI model received all rubrics and re-turned normalized scores between 0 and 1 for each rubric. Each rubric consisted of one or more sentences describing an independent aspect of the answer to be evaluated.
Reasoning Enhancement
Building on the baseline, we tested two reasoning-based variants:
- Baseline with Reasoning: Required explicit justification before scoring
- Baseline with Scale and Reasoning: Added a generic ten-level achievement scale (0.0 to 1.0 in 0.1 steps) with verbal descriptions
Rubric-Specific Scales
A more specialized variant incorporated custom achievement scales for each rubric, requiring the AI to consider rubric-specific grading criteria while providing reasoning.
Per-Rubric Evaluation
We also tested an alternative approach that evaluated one rubric at a time:
- Basic: Evaluated a single rubric with previous rubrics and associated scores as context
- With Reasoning: Added explicit reasoning requirement
- With Scale and Reasoning: Included generic achievement scale and reasoning

Top

Large vs. Small LLMs

Our analysis revealed important distinctions between more and less capable language models when applied to exam grading tasks. Larger models like GPT-4o and Claude 3.5 Sonnet demonstrated the ability to handle complex prompts with multiple rubrics simultaneously, while maintaining consistency across grading decisions. In contrast, smaller models like GPT-4o-mini and Llama 3.1 70b struggled with comprehensive prompts, often failing to output the correct number of rubrics or providing inconsistent reasoning. This capability gap influenced our approach to prompt engineering and model selection, leading to different optimal strategies depending on model size.

This distinction is particularly important as it affects both the cost and complexity of implementing AI-assisted grading. While larger models like Claude 3.5 Sonnet can process comprehensive prompts more efficiently, smaller models like GPT-4o-mini can still provide valuable assistance when tasks are appropriately simplified, such as evaluating one rubric at a time.

Top

Grading Consistency

Our analysis revealed significant differences in grading consistency across different AI models. Grading consistency refers to how reliably a model assigns the same scores when evaluating the same answer multiple times - an important factor for maintaining fairness in assessment. To measure consistency, we ran each grading task three times and analyzed the variance between runs.

The Anthropic models (Claude 3.5 Sonnet and Claude 3.5 Haiku) demonstrated the highest consistency, particularly when evaluating scalar rubrics that required more nuanced scoring decisions. This was especially evident in the Political Science exams, where small grading differences could result in large point variations. In contrast, GPT-4o showed higher variance in its predictions for Political Science exercises, while Llama 3.1 70B demonstrated increased variance in the Finance exam.

Multiple grading runs improved consistency for GPT-4o, GPT-4o-mini, and Llama 3.1 70B, but provided minimal benefits for the already consistent Anthropic models. This suggests that while all models can provide useful grading assistance, the choice of model significantly impacts grading reliability, particularly for complex assessment tasks with scalar rubrics. The consistency advantage of Anthropic models was particularly pronounced in exercises with scalar rubrics worth more points, where consistent grading was crucial, as small differences could result in significant point variations.

Top

Model Performance

To evaluate model performance systematically, we computed several metrics comparing AI-generated grading decisions with human-graded ground truth. Each model configuration was run three times to ensure reliable results, with metrics calculated both per rubric and per exercise. Our evaluation framework included mean absolute error (MAE) and root mean squared error (RMSE) to measure grading accuracy, variance be-tween runs to assess consistency, accuracy scores for binary rubrics, correlation between AI and human-assigned total points, and context switch cost measuring the average difference between consecutive grading vectors.

Claude 3.5 Sonnet (claude-3-5-sonnet-20241022)

Claude 3.5 Sonnet consistently demonstrated superior performance across all metrics, particularly excelling in scalar rubrics and complex assessments. The model showed exceptional consistency without requiring multiple grading runs, maintained strong correlation with human grading patterns, and achieved the best context switch scores. It proved especially effective at providing balanced scoring across the full range of student performance, avoiding the overly optimistic scoring tendencies seen in other models.

Claude 3.5 Haiku (claude-3-5-haiku-20241022)

Claude 3.5 Haiku performed as the second-best model overall, showing similar strengths to Sonnet but with slightly lower performance metrics. Like Sonnet, it demonstrated high consistency without requiring multiple runs and performed particularly well on scalar rubrics. The model maintained reliable performance across different assessment types and showed good correlation with human grading patterns.

GPT-4o (gpt-4o-2024-08-06)

GPT-4o showed mixed performance across different tasks. While it excelled at binary rubrics and achieved good context switch scores (ranking second in this metric), it struggled more with scalar rubrics and complex assessments. The model benefited from multiple grading runs to improve consistency and showed a tendency toward optimistic scoring, particularly for lower-performing responses.

GPT-4o-mini (gpt-4o-mini-2024-07-18)

GPT-4o-mini performed well when tasks were appropriately structured, particularly with per-rubric prompting and binary rubrics. However, it struggled with comprehensive prompts and maintaining consistency across multiple rubrics. The model required multiple grading runs to improve reliability and showed higher variance in predictions, especially for complex scalar rubrics.

Llama 3.1 70B Instruct Turbo (TogetherAI)

Llama 3.1 70B consistently ranked lowest among the tested models, though it still provided usable results for basic grading assistance. The model showed higher variance in predictions, particularly in the Finance exam, and required multiple grading runs to improve consistency. It performed adequately on binary rubrics but struggled more with complex scalar assessments.

Conclusion

The analysis revealed clear differences in performance across AI models, with distinct patterns emerging for different types of assessment tasks. While Claude 3.5 Sonnet consistently demonstrated superior performance across all metrics, followed by Claude 3.5 Haiku, GPT-4o, GPT-4o-mini, and Llama 3.1 70B, all models provided acceptable results for basic grading assistance. Binary rubrics showed similar performance levels across models, while larger performance gaps emerged with scalar rubrics and complex assessments. This suggests that while model choice impacts grading reliability, even simpler models can provide valuable assistance when tasks are appropriately structured.

Department of Finance AI-assisted Grading

Quicklinks und Sprachwechsel

Main navigation