--- category: academic type: academic person: Yanxin Lu date: 2014-12 source: master_defense_2014.pptx --- # Master's Thesis Defense: Improving Peer Evaluation Quality in MOOCs Yanxin Lu, December 2014. 40 slides. ## Slide 2: Title Improving Peer Evaluation Quality in MOOCs — Yanxin Lu, December 2014 ## Slide 3–4: Summary - Motivations and Problems - Experiment - Statistical Analysis - Results - Conclusion ## Slide 5: What is MOOC? ## Slide 6: Intro to Interactive Programming in Python - Coursera course, 120,000 enrolled, 7,500 completed ## Slide 7–8: Example Assignments - Stopwatch - Memory game ## Slide 9: Grading Rubric for Stopwatch - 1 pt: Program successfully opens a frame with the stopwatch stopped - 2 pts: Program correctly draws number of successful stops at whole second vs total stops ## Slide 10: Peer Grading - Example scores: 1, 9, 9, 9, 10 → Score = 9 ## Slide 11: Quality is Highly Variable - Lack of effort - Small bugs require more effort ## Slide 12: Solution A web application where students can: - Look at other peer evaluations - Grade other peer evaluations ## Slide 13: Findings - Grading evaluation has the strongest effect - The knowledge that one's own peer evaluation will be examined does not - Strong effect on peer evaluation quality simply because students know they are being studied ## Slide 15: Experiment Summary - Sign up → Stopwatch → Memory ## Slide 16: Sign up - Web consent form, three groups, prize - Nothing about specific study goals or what was being measured - 3,015 students ## Slide 17: Three Groups - G1: Full treatment, grading + viewing - G2: Only viewing - G3: Control group - Size ratio G1:G2:G3 = 8:1:1 ## Slides 18–24: Experiment Phases - Submission Phase: Submit programs before deadline - Evaluation Phase: 1 self evaluation + 5 peer evaluations per rubric item (score + optional comment) - Grading Evaluation Phase (G1): Web app, per evaluation × rubric item → Good/Neutral/Bad - Viewing Phase (G1, G2): See number of good/neutral/bad ratings and their own evaluation ## Slide 25: Statistics - Most evaluations are graded three times ## Slide 27: Goal - Whether G1 does better grading compared to G2, G3 or both - Measuring quality: correct scores, comment length - Reject a set of null hypotheses ## Slide 28: Bootstrapping - Simulation-based method using resampling with replacement - Statistically significant: p-value <= 0.05 ## Slide 30: Terms - Good programs: correct (machine grader verified) - Bad programs: incorrect - Bad job: incorrect grade OR no comment - Really bad job: incorrect grade AND no comment ## Slides 31–38: Results Hypothesis tests on comment length, "bad job" fraction, and "really bad job" fraction across groups on good and bad programs. ## Slide 39: Findings - Grading evaluation has the strongest positive effect - The knowledge that one's own peer evaluation will be examined does not - Strong Hawthorne effect: improvement simply from knowing they are being studied ## Slide 40: Conclusion - A web application for peer evaluation assessment - Study has positive effect on quality of peer evaluations - Implications beyond peer evaluations