- PhD defense slides (defense.key, Nov 2018) → phd_defense/ - Master's defense on MOOC peer evaluation (Dec 2014) - ENGI 600 data-driven program repair (Apr 2015) - COMP 600 data-driven program completion (Fall 2015, Spring 2016) - COMP 600 Program Splicing presentation + feedback + response (Spring 2018) - Program Splicing slides in .key and .pdf formats (Spring 2018) Each file has a .md transcription with academic frontmatter. Skipped www2015.pdf (duplicate of existing www15.zip) and syncthing conflict copy.
3.7 KiB
3.7 KiB
category, type, person, date, source
| category | type | person | date | source |
|---|---|---|---|---|
| academic | academic | Yanxin Lu | 2018-04 | splicing_comp600_slides_2018.pdf |
Program Splicing — COMP 600 Slides (PDF)
Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. Presented by Yanxin Lu. 31 slides.
PDF export of COMP 600 presentation on Program Splicing. This is an earlier version of the presentation (title slide says "Presented by Yanxin Lu"). See also splicing_comp600_2018.pdf for a slightly revised version with subtitle "Data-driven Program Synthesis".
Slide 2: Title
Program Splicing — Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. Presented by Yanxin Lu.
Slide 3: Copying and Pasting
- Problem: developers search online, copy code, adapt it — time consuming and bugs introduced
Slide 4: Program Synthesis
- Automatically generating programs
- Specification: logic formula, unit testing, natural language
- Correctness
Slide 5: Problem
Can we use program synthesis to help the process of copying and pasting?
Slide 6: Related work
- Sketch (FMCAD 2013) — cannot synthesize statements, does not use a code database
- Code Transplantation (ISSTA 2015) — not efficient, does not search for relevant code snippets
Slide 7: Program Splicing
- Use a large corpus of over 3.5 million programs
- Automate the process of copying and pasting
- Ensure correctness
Slide 8: Summary
- Architecture (corpus and Pliny database, synthesis algorithm)
- Experiment
- Conclusion
Slide 9–10: Architecture
- User provides draft program → Synthesis queries PDB → Top-k relevant programs → Completed program
Slide 11: PDB
- 3.5 million Java programs with features from GitHub, SourceForge
- Natural language terms: "read": 0.10976, "matrix": 0.65858, ...
- Similarity metrics, fast top-k query (1-2 orders of magnitude faster than no-SQL)
Slide 13: Relevant programs
- Draft program with holes + COMMENT/REQ specification → PDB returns similar programs
Slides 14–16: Filling in the holes
- Enumerative search: try candidate expressions from relevant programs
- Progressive selection of code fragments
Slide 17: Variable Renaming
- Resolve undefined variables by mapping from relevant program's variables
Slide 18: Testing
- Filter out incorrect programs using unit tests
Slide 19–20: Benchmark
| Benchmark | Synthesis Time (s) | LOC | Var | Holes (expr-stmt) | Test | uScalpel |
|---|---|---|---|---|---|---|
| Sieve Prime | 4.6 | 12-17 | 2 | 2-1 | 3 | 162.1 |
| Collision Detection | 4.2 | 10-15 | 2 | 2-1 | 4 | N/A |
| Collecting Files | 3.0 | 13-25 | 2 | 1-1 | 2 | timeout |
| Binary Search | 15.4 | 12-20 | 5 | 1-1 | 3 | timeout |
| HTTP Server | 41.1 | 24-45 | 6 | 1-2 | 2 | N/A |
| Prim's Distance Update | 61.1 | 53-58 | 11 | 1-1 | 4 | timeout |
| Quick Sort | 77.2 | 11-18 | 6 | 1-1 | 1 | timeout |
| CSV | 88.4 | 13-23 | 4 | 1-2 | 2 | timeout |
| Matrix Multiplication | 108.9 | 13-15 | 8 | 1-1 | 1 | timeout |
| Floyd Warshall | 110.4 | 9-12 | 7 | 1-1 | 7 | timeout |
| HTML Parsing | 140.4 | 20-34 | 5 | 1-2 | 2 | N/A |
| LCS | 161.5 | 29-36 | 10 | 0-1 | 1 | timeout |
Synthesis algorithm is efficient. No need to write many tests.
Slides 22–26: User study
- 12 graduate students and 6 professionals
- Web-based programming environment
- 4 programming problems (2 with splicing, 2 without)
- Internet search encouraged
- Results: splicing reduced time for algorithmic tasks (sieve, files)
- Sieve: appears simple but was not (deceptively simple)
- Files/CSV: no standard solutions — splicing helps most
- HTML: good documentation and tests were hard to write
Slide 27: Conclusion
- Data-driven program synthesis using large code corpus
- Enumerative search
- User study: good for tasks without standard solutions