obsidian-yanxin/documents/academic/presentations/comp600_fall2015.md

---
category: academic
type: academic
person: Yanxin Lu
date: 2015-08
source: comp600_fall2015.pptx
---

# COMP 600 Fall 2015: Data Driven Program Completion

Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, Vijayaraghavan Murali. Presented by Yanxin Lu. 26 slides.

## Slide 2: Title
Data Driven Program Completion

## Slide 3: Programming is difficult

## Slide 4: Program Synthesis
- Automatically generating programs
- Specification: logic formula, unit testing, natural language
- Hard problem!

## Slide 5: Related work
- Deductive and solver-aided synthesis (IEEE Trans. Software Eng. 18(8), PLDI 2014)
- Constraint-based synthesis: syntax-guided synthesis (FMCAD 2013), Sketching (ASPLOS 2006), Template (STTT 15(5-6))
- Inductive synthesis: input-output examples (POPL 2011, PLDI 2015)

## Slide 6: Big data
- GitHub, SourceForge, Google Code, StackOverflow

## Slide 7: Summary
- Data-driven program completion, demo, corpus and Pliny database, synthesis algorithm, initial experiment and future work, program repair

## Slide 8: Program Completion
- A subset of C

## Slide 9: Demo

## Slide 10–11: Architecture
- Synthesis ↔ PDB (Pliny Database)
- Incomplete program → query → top-k similar programs → completed program

## Slide 12: PDB
- Thousands of programs with features
- Similarity metrics, fast top-k query
- 1-2 orders of magnitude faster than no-SQL database systems (Chris Jermaine)

## Slide 13: Corpus
- More than 100,000 projects from GitHub, SourceForge, Google Code
- C, C++, Java
- Preprocessing: 50GB source code, 480+ projects, C

## Slide 14: Feature Extraction
- Lightweight program analysis capturing characteristics
- Abstract Structural Skeleton: (seq (loop (seq (cond ()))))
- Coupling: ('int', 'c:unary-'), ('int', 'c:/'), ('int*', 'c:+'), etc.

## Slides 16–19: Synthesis Algorithm
- Finding similar programs from PDB
- Filling in the holes via search
- Variable renaming for undefined variables
- Unit testing to filter incorrect programs

## Slides 20–22: Heuristics
- Types: ignore expressions with incompatible types
- Context: ignore expressions with no common parents
- Huge search space reduction

## Slide 23: Initial experiment and future work
- Binary search: less than 10 seconds
- Future work: more benchmark problems, performance increase

## Slides 24–25: Program Repair
- Use PDB to find most similar correct program
- Bug localization → program completion problem

## Slide 26: Conclusion
- Program completion + program repair using big data + programming languages