obsidian-yanxin/documents/academic/presentations/master_defense_2014.md

---
category: academic
type: academic
person: Yanxin Lu
date: 2014-12
source: master_defense_2014.pptx
---

# Master's Thesis Defense: Improving Peer Evaluation Quality in MOOCs

Yanxin Lu, December 2014. 40 slides.

## Slide 2: Title
Improving Peer Evaluation Quality in MOOCs — Yanxin Lu, December 2014

## Slide 3–4: Summary
- Motivations and Problems
- Experiment
- Statistical Analysis
- Results
- Conclusion

## Slide 5: What is MOOC?

## Slide 6: Intro to Interactive Programming in Python
- Coursera course, 120,000 enrolled, 7,500 completed

## Slide 7–8: Example Assignments
- Stopwatch
- Memory game

## Slide 9: Grading Rubric for Stopwatch
- 1 pt: Program successfully opens a frame with the stopwatch stopped
- 2 pts: Program correctly draws number of successful stops at whole second vs total stops

## Slide 10: Peer Grading
- Example scores: 1, 9, 9, 9, 10 → Score = 9

## Slide 11: Quality is Highly Variable
- Lack of effort
- Small bugs require more effort

## Slide 12: Solution
A web application where students can:
- Look at other peer evaluations
- Grade other peer evaluations

## Slide 13: Findings
- Grading evaluation has the strongest effect
- The knowledge that one's own peer evaluation will be examined does not
- Strong effect on peer evaluation quality simply because students know they are being studied

## Slide 15: Experiment Summary
- Sign up → Stopwatch → Memory

## Slide 16: Sign up
- Web consent form, three groups, prize
- Nothing about specific study goals or what was being measured
- 3,015 students

## Slide 17: Three Groups
- G1: Full treatment, grading + viewing
- G2: Only viewing
- G3: Control group
- Size ratio G1:G2:G3 = 8:1:1

## Slides 18–24: Experiment Phases
- Submission Phase: Submit programs before deadline
- Evaluation Phase: 1 self evaluation + 5 peer evaluations per rubric item (score + optional comment)
- Grading Evaluation Phase (G1): Web app, per evaluation × rubric item → Good/Neutral/Bad
- Viewing Phase (G1, G2): See number of good/neutral/bad ratings and their own evaluation

## Slide 25: Statistics
- Most evaluations are graded three times

## Slide 27: Goal
- Whether G1 does better grading compared to G2, G3 or both
- Measuring quality: correct scores, comment length
- Reject a set of null hypotheses

## Slide 28: Bootstrapping
- Simulation-based method using resampling with replacement
- Statistically significant: p-value <= 0.05

## Slide 30: Terms
- Good programs: correct (machine grader verified)
- Bad programs: incorrect
- Bad job: incorrect grade OR no comment
- Really bad job: incorrect grade AND no comment

## Slides 31–38: Results
Hypothesis tests on comment length, "bad job" fraction, and "really bad job" fraction across groups on good and bad programs.

## Slide 39: Findings
- Grading evaluation has the strongest positive effect
- The knowledge that one's own peer evaluation will be examined does not
- Strong Hawthorne effect: improvement simply from knowing they are being studied

## Slide 40: Conclusion
- A web application for peer evaluation assessment
- Study has positive effect on quality of peer evaluations
- Implications beyond peer evaluations