--- category: academic type: academic person: Yanxin Lu date: 2018-04 source: splicing_comp600_slides_2018.pdf --- # Program Splicing — COMP 600 Slides (PDF) Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. Presented by Yanxin Lu. 31 slides. PDF export of COMP 600 presentation on Program Splicing. This is an earlier version of the presentation (title slide says "Presented by Yanxin Lu"). See also splicing_comp600_2018.pdf for a slightly revised version with subtitle "Data-driven Program Synthesis". ## Slide 2: Title Program Splicing — Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. Presented by Yanxin Lu. ## Slide 3: Copying and Pasting - Problem: developers search online, copy code, adapt it — time consuming and bugs introduced ## Slide 4: Program Synthesis - Automatically generating programs - Specification: logic formula, unit testing, natural language - Correctness ## Slide 5: Problem Can we use program synthesis to help the process of copying and pasting? ## Slide 6: Related work - Sketch (FMCAD 2013) — cannot synthesize statements, does not use a code database - Code Transplantation (ISSTA 2015) — not efficient, does not search for relevant code snippets ## Slide 7: Program Splicing - Use a large corpus of over 3.5 million programs - Automate the process of copying and pasting - Ensure correctness ## Slide 8: Summary - Architecture (corpus and Pliny database, synthesis algorithm) - Experiment - Conclusion ## Slide 9–10: Architecture - User provides draft program → Synthesis queries PDB → Top-k relevant programs → Completed program ## Slide 11: PDB - 3.5 million Java programs with features from GitHub, SourceForge - Natural language terms: "read": 0.10976, "matrix": 0.65858, ... - Similarity metrics, fast top-k query (1-2 orders of magnitude faster than no-SQL) ## Slide 13: Relevant programs - Draft program with holes + COMMENT/REQ specification → PDB returns similar programs ## Slides 14–16: Filling in the holes - Enumerative search: try candidate expressions from relevant programs - Progressive selection of code fragments ## Slide 17: Variable Renaming - Resolve undefined variables by mapping from relevant program's variables ## Slide 18: Testing - Filter out incorrect programs using unit tests ## Slide 19–20: Benchmark | Benchmark | Synthesis Time (s) | LOC | Var | Holes (expr-stmt) | Test | uScalpel | |---|---|---|---|---|---|---| | Sieve Prime | 4.6 | 12-17 | 2 | 2-1 | 3 | 162.1 | | Collision Detection | 4.2 | 10-15 | 2 | 2-1 | 4 | N/A | | Collecting Files | 3.0 | 13-25 | 2 | 1-1 | 2 | timeout | | Binary Search | 15.4 | 12-20 | 5 | 1-1 | 3 | timeout | | HTTP Server | 41.1 | 24-45 | 6 | 1-2 | 2 | N/A | | Prim's Distance Update | 61.1 | 53-58 | 11 | 1-1 | 4 | timeout | | Quick Sort | 77.2 | 11-18 | 6 | 1-1 | 1 | timeout | | CSV | 88.4 | 13-23 | 4 | 1-2 | 2 | timeout | | Matrix Multiplication | 108.9 | 13-15 | 8 | 1-1 | 1 | timeout | | Floyd Warshall | 110.4 | 9-12 | 7 | 1-1 | 7 | timeout | | HTML Parsing | 140.4 | 20-34 | 5 | 1-2 | 2 | N/A | | LCS | 161.5 | 29-36 | 10 | 0-1 | 1 | timeout | Synthesis algorithm is efficient. No need to write many tests. ## Slides 22–26: User study - 12 graduate students and 6 professionals - Web-based programming environment - 4 programming problems (2 with splicing, 2 without) - Internet search encouraged - Results: splicing reduced time for algorithmic tasks (sieve, files) - Sieve: appears simple but was not (deceptively simple) - Files/CSV: no standard solutions — splicing helps most - HTML: good documentation and tests were hard to write ## Slide 27: Conclusion - Data-driven program synthesis using large code corpus - Enumerative search - User study: good for tasks without standard solutions