diff --git a/documents/academic/phd_defense/defense_slides.key b/documents/academic/phd_defense/defense_slides.key new file mode 100644 index 0000000..95f1ea4 Binary files /dev/null and b/documents/academic/phd_defense/defense_slides.key differ diff --git a/documents/academic/phd_defense/defense_slides.md b/documents/academic/phd_defense/defense_slides.md new file mode 100644 index 0000000..632df0f --- /dev/null +++ b/documents/academic/phd_defense/defense_slides.md @@ -0,0 +1,17 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2018-11 +source: defense_slides.key +--- + +# PhD Thesis Defense Slides + +Keynote presentation for Yanxin Lu's PhD thesis defense at Rice University, November 2018. + +Topic: Program Splicing — Data-driven Program Synthesis + +The defense covers the same material as the PhD thesis: using a large corpus of programs (3.5 million from GitHub and SourceForge) to automatically synthesize code by splicing together relevant code fragments. The system uses the Pliny database (PDB) for efficient top-k retrieval of similar programs, enumerative search to fill in program holes, variable renaming to resolve undefined variables, and unit testing to filter out incorrect candidates. Benchmarks demonstrate efficient synthesis times (3–161 seconds) across problems like sieve prime, binary search, CSV parsing, matrix multiplication, and LCS. A user study with 12 graduate students and 6 professionals showed program splicing significantly reduced programming time, especially for algorithmic tasks and tasks without standard solutions. + +Note: The preview image shows only the title slide (blank/white). The full Keynote file contains the complete presentation. diff --git a/documents/academic/presentations/codecomplete_spring2016.md b/documents/academic/presentations/codecomplete_spring2016.md new file mode 100644 index 0000000..25776ab --- /dev/null +++ b/documents/academic/presentations/codecomplete_spring2016.md @@ -0,0 +1,74 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2016-01 +source: codecomplete_spring2016.pptx +--- + +# COMP 600 Spring 2016: Data Driven Program Completion + +Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, Drew Dehaas, Vineeth Kashyap, and David Melski. Presented by Yanxin Lu. 29 slides. + +## Slide 2: Title +Data Driven Program Completion + +## Slide 3–4: Programming is difficult +- Longest Common Subsequence example + +## Slide 5: Program Synthesis +- Automatically generating programs +- Specification: logic formula, unit testing, natural language + +## Slide 6: Related work +- Deductive and solver-aided synthesis +- Constraint-based synthesis: syntax-guided synthesis, Sketching, Template +- Inductive synthesis: input-output examples + +## Slide 7: Big data +- GitHub, SourceForge, Google Code, StackOverflow + +## Slide 8–9: Summary +- Data-driven program completion, corpus and Pliny database, synthesis algorithm, initial experiment and future work + +## Slide 10–11: Program completion +- Sketch + programs in DB + test cases +- LCS example: LCS("123", "123") = "123", LCS("123", "234") = "23" + +## Slide 12–13: Workflow +- Synthesis ↔ PDB +- Incomplete program → query → programs → completed program + +## Slide 14: PDB +- Thousands of programs with features, similarity metrics +- Fast top-k query: 1-2 orders of magnitude faster than no-SQL systems + +## Slide 15: Corpus +- 100,000+ projects, C/C++/Java +- 50GB source code, 480+ C projects + +## Slide 16: Feature Extraction +- Names: X, s, n, j, Y, index, lcs +- TF/IDF: "charact": 0.158, "reduc": 0.158, "result": 0.316, "lc": 0.791, "index": 0.316 + +## Slides 18–21: Synthesis Algorithm +- Search PDB for similar programs +- Fill holes via enumerative search +- Merge undefined variables +- Test to filter incorrect programs + +## Slides 22–24: Heuristics +- Types: ignore incompatible types +- Context: ignore expressions with no common parents +- Huge search space reduction + +## Slides 25–26: Initial experiment and future work +- LCS: less than 10 seconds +- Future work: more benchmarks, closure, search PDB using types + +## Slides 27–28: Program repair +- Use PDB to find most similar correct program +- Bug localization → holes → completion + +## Slide 29: Conclusion +- Program Completion: no more copy and paste, focus on important tasks diff --git a/documents/academic/presentations/codecomplete_spring2016.pptx b/documents/academic/presentations/codecomplete_spring2016.pptx new file mode 100644 index 0000000..3fb2e04 Binary files /dev/null and b/documents/academic/presentations/codecomplete_spring2016.pptx differ diff --git a/documents/academic/presentations/comp600_fall2015.md b/documents/academic/presentations/comp600_fall2015.md new file mode 100644 index 0000000..435b691 --- /dev/null +++ b/documents/academic/presentations/comp600_fall2015.md @@ -0,0 +1,78 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2015-08 +source: comp600_fall2015.pptx +--- + +# COMP 600 Fall 2015: Data Driven Program Completion + +Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, Vijayaraghavan Murali. Presented by Yanxin Lu. 26 slides. + +## Slide 2: Title +Data Driven Program Completion + +## Slide 3: Programming is difficult + +## Slide 4: Program Synthesis +- Automatically generating programs +- Specification: logic formula, unit testing, natural language +- Hard problem! + +## Slide 5: Related work +- Deductive and solver-aided synthesis (IEEE Trans. Software Eng. 18(8), PLDI 2014) +- Constraint-based synthesis: syntax-guided synthesis (FMCAD 2013), Sketching (ASPLOS 2006), Template (STTT 15(5-6)) +- Inductive synthesis: input-output examples (POPL 2011, PLDI 2015) + +## Slide 6: Big data +- GitHub, SourceForge, Google Code, StackOverflow + +## Slide 7: Summary +- Data-driven program completion, demo, corpus and Pliny database, synthesis algorithm, initial experiment and future work, program repair + +## Slide 8: Program Completion +- A subset of C + +## Slide 9: Demo + +## Slide 10–11: Architecture +- Synthesis ↔ PDB (Pliny Database) +- Incomplete program → query → top-k similar programs → completed program + +## Slide 12: PDB +- Thousands of programs with features +- Similarity metrics, fast top-k query +- 1-2 orders of magnitude faster than no-SQL database systems (Chris Jermaine) + +## Slide 13: Corpus +- More than 100,000 projects from GitHub, SourceForge, Google Code +- C, C++, Java +- Preprocessing: 50GB source code, 480+ projects, C + +## Slide 14: Feature Extraction +- Lightweight program analysis capturing characteristics +- Abstract Structural Skeleton: (seq (loop (seq (cond ())))) +- Coupling: ('int', 'c:unary-'), ('int', 'c:/'), ('int*', 'c:+'), etc. + +## Slides 16–19: Synthesis Algorithm +- Finding similar programs from PDB +- Filling in the holes via search +- Variable renaming for undefined variables +- Unit testing to filter incorrect programs + +## Slides 20–22: Heuristics +- Types: ignore expressions with incompatible types +- Context: ignore expressions with no common parents +- Huge search space reduction + +## Slide 23: Initial experiment and future work +- Binary search: less than 10 seconds +- Future work: more benchmark problems, performance increase + +## Slides 24–25: Program Repair +- Use PDB to find most similar correct program +- Bug localization → program completion problem + +## Slide 26: Conclusion +- Program completion + program repair using big data + programming languages diff --git a/documents/academic/presentations/comp600_fall2015.pptx b/documents/academic/presentations/comp600_fall2015.pptx new file mode 100644 index 0000000..2bce853 Binary files /dev/null and b/documents/academic/presentations/comp600_fall2015.pptx differ diff --git a/documents/academic/presentations/comp600_feedback_2018.md b/documents/academic/presentations/comp600_feedback_2018.md new file mode 100644 index 0000000..58d75b4 --- /dev/null +++ b/documents/academic/presentations/comp600_feedback_2018.md @@ -0,0 +1,65 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2018-03 +source: comp600_feedback_2018.pages +--- + +# COMP 600 Presentation Feedback + +Peer and instructor feedback on Yanxin Lu's COMP 600 presentation (Program Splicing), March 2018. + +Good pace for the motivation part which explained the problem really well. +Stance is not good. Kept moving. Kept looking back to the screen. +Not really smooth in the beginning of the talk. +Related work is a little bit long which takes a lot of time. +Energy is not enough when introducing program splicing. +Good pace for the demo. But I was moving all the time. Need to stand still. + +Kept moving in the architecture slide, and kept looking back. +Good gesture for demonstrating the KNN search. +PDB went a little bit fast. Need more details. + +Enumerative search could be faster and don't need to show the process of enumeration. +Kept moving all the time in the benchmark problems. +Not very smooth in the benchmark problems. + +Too much text in the user study slide. + +Not smooth in the user study result slide, especially in the sieve problem. + +Stance is not good through out the talk. +Went a little bit fast at the end of the talk because of time. + +**Best feature** +1. PDB and related work could be shorter. +2. The motivation, example makes the problem easy to understand. +3. Mention the limit of the related work. +4. Good demo + +1. Confident +2. Good voice control and eye contact + +1. Tables were well explained. + +1. Good handling questions. + +**Message/organizations** +1. Need more technical details. +2. Source code license has to be covered +3. Demo was not very useful while a few slides can do the work. +4. Programming problems might not reflect the real-world improvements. +5. Explain more on the experiments. +6. Need statistically significance and power for the hypothesis test. Too few samples. +7. Not clear what the contribution is +8. Not mention limitations of the work. +9. The example in the demo might not be a good one. + +**Delivery** +1. Keep his stance +2. Not showing enough passion +3. Louder + +**Visuals** +1. Architecture flowchart could be made better diff --git a/documents/academic/presentations/comp600_feedback_2018.pages b/documents/academic/presentations/comp600_feedback_2018.pages new file mode 100644 index 0000000..4408104 Binary files /dev/null and b/documents/academic/presentations/comp600_feedback_2018.pages differ diff --git a/documents/academic/presentations/comp600_response_2018.md b/documents/academic/presentations/comp600_response_2018.md new file mode 100644 index 0000000..ae02959 --- /dev/null +++ b/documents/academic/presentations/comp600_response_2018.md @@ -0,0 +1,21 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2018-04 +source: comp600_response_2018.pages +--- + +# COMP 600 Response + +Yanxin Lu +COMP 600 Response +Monday, April 2, 2018 + +The best moment in this presentation was from the motivation section to the demo section. I talked slowly and paused at important points. This created a sense of emphasis on some very important point I wanted to make. In addition, by talking slowly, the audience was able to understand the motivation and the work I have been doing, even for the people that do not have any background knowledge. The demo also helped people understand what the tool does and it also drew a lot of attention. + +After I viewed the video and peer's reviews, I was surprised that my stance look very awkward and I did not realize this at all during the presentation. I kept making unnecessary moves and looked not very serious. Another thing that surprises me is that people complains about me not providing enough technical details. I thought that the technical details were given enough, but that did not seem to be the case. + +One of my greatest strength is the ability to motivate the talk and explain a complex problem very clearly. I used a very easy example throughout the talk and the audience was able to understand the talk through the example easily. Another strength is the ability to explain the data. I highlighted some important points when explaining the data, because most of the time it is hard to understand what data implies without any guidance. The third strength is the delivery. I showed confidence by talking slowly and making good eye contact. + +The thing I need to improve is stance. Moving too much not only looks awkward, but it also creates an impression of not being serious and lack of authority. Standing still also makes me look more confident. The second area I need to improve is handling questions and the ability to control the situation. Sometimes I had hard time understanding people and could have done better on controlling the situation when people are having discussions among themselves. To actually improve in those areas, I will attend talks, focus on how good presenters do in those areas and try to learn from them. diff --git a/documents/academic/presentations/comp600_response_2018.pages b/documents/academic/presentations/comp600_response_2018.pages new file mode 100644 index 0000000..824fc7c Binary files /dev/null and b/documents/academic/presentations/comp600_response_2018.pages differ diff --git a/documents/academic/presentations/engi600_2015.md b/documents/academic/presentations/engi600_2015.md new file mode 100644 index 0000000..c4c4803 --- /dev/null +++ b/documents/academic/presentations/engi600_2015.md @@ -0,0 +1,55 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2015-04 +source: engi600_2015.pptx +--- + +# ENGI 600: Data-Driven Program Repair + +Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, Joe Warren, and Scott Rixner. 12 slides. + +## Slide 1: Title +Data-Driven Program Repair + +## Slide 2: Debugging is difficult + +## Slide 3: Related work +- Talus, tutoring system (Murray, 1986) — reference program, program analysis +- Mutation (Debroy and Wong, 2010) — predefined rules for mutating programs + +## Slide 4: Data-driven program repair +- Code database for evaluatePoly +- Incorrect Program → Correct Program + +## Slide 5: EvaluatePoly +- A program which evaluates polynomials +- Poly: a list of coefficients +- X: the x value in the polynomial + +## Slide 6: Similar and correct implementations +- Distance between programs +- Incorrect Program → Code Database → Correct Programs → Template Generation + +## Slide 7: Template Generation +- Find differences and replace them with holes +- Ignore variable names + +## Slide 8: Filling in the Holes +- Search for ways to replace holes +- Variable Renaming + +## Slide 9: Variable Renaming +- Rename variables in the good program + +## Slide 10: Unit Testing +- Filter all incorrect fixes using unit testing +- If multiple correct fixes, choose the most similar one + +## Slide 11: Experiment + +## Slide 12: Conclusion +- Data-driven program repair +- Effective in fixing small incorrect programs +- Computer science education — same mistakes diff --git a/documents/academic/presentations/engi600_2015.pptx b/documents/academic/presentations/engi600_2015.pptx new file mode 100644 index 0000000..6f7ef78 Binary files /dev/null and b/documents/academic/presentations/engi600_2015.pptx differ diff --git a/documents/academic/presentations/master_defense_2014.md b/documents/academic/presentations/master_defense_2014.md new file mode 100644 index 0000000..0d2eaee --- /dev/null +++ b/documents/academic/presentations/master_defense_2014.md @@ -0,0 +1,102 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2014-12 +source: master_defense_2014.pptx +--- + +# Master's Thesis Defense: Improving Peer Evaluation Quality in MOOCs + +Yanxin Lu, December 2014. 40 slides. + +## Slide 2: Title +Improving Peer Evaluation Quality in MOOCs — Yanxin Lu, December 2014 + +## Slide 3–4: Summary +- Motivations and Problems +- Experiment +- Statistical Analysis +- Results +- Conclusion + +## Slide 5: What is MOOC? + +## Slide 6: Intro to Interactive Programming in Python +- Coursera course, 120,000 enrolled, 7,500 completed + +## Slide 7–8: Example Assignments +- Stopwatch +- Memory game + +## Slide 9: Grading Rubric for Stopwatch +- 1 pt: Program successfully opens a frame with the stopwatch stopped +- 2 pts: Program correctly draws number of successful stops at whole second vs total stops + +## Slide 10: Peer Grading +- Example scores: 1, 9, 9, 9, 10 → Score = 9 + +## Slide 11: Quality is Highly Variable +- Lack of effort +- Small bugs require more effort + +## Slide 12: Solution +A web application where students can: +- Look at other peer evaluations +- Grade other peer evaluations + +## Slide 13: Findings +- Grading evaluation has the strongest effect +- The knowledge that one's own peer evaluation will be examined does not +- Strong effect on peer evaluation quality simply because students know they are being studied + +## Slide 15: Experiment Summary +- Sign up → Stopwatch → Memory + +## Slide 16: Sign up +- Web consent form, three groups, prize +- Nothing about specific study goals or what was being measured +- 3,015 students + +## Slide 17: Three Groups +- G1: Full treatment, grading + viewing +- G2: Only viewing +- G3: Control group +- Size ratio G1:G2:G3 = 8:1:1 + +## Slides 18–24: Experiment Phases +- Submission Phase: Submit programs before deadline +- Evaluation Phase: 1 self evaluation + 5 peer evaluations per rubric item (score + optional comment) +- Grading Evaluation Phase (G1): Web app, per evaluation × rubric item → Good/Neutral/Bad +- Viewing Phase (G1, G2): See number of good/neutral/bad ratings and their own evaluation + +## Slide 25: Statistics +- Most evaluations are graded three times + +## Slide 27: Goal +- Whether G1 does better grading compared to G2, G3 or both +- Measuring quality: correct scores, comment length +- Reject a set of null hypotheses + +## Slide 28: Bootstrapping +- Simulation-based method using resampling with replacement +- Statistically significant: p-value <= 0.05 + +## Slide 30: Terms +- Good programs: correct (machine grader verified) +- Bad programs: incorrect +- Bad job: incorrect grade OR no comment +- Really bad job: incorrect grade AND no comment + +## Slides 31–38: Results +Hypothesis tests on comment length, "bad job" fraction, and "really bad job" fraction across groups on good and bad programs. + +## Slide 39: Findings +- Grading evaluation has the strongest positive effect +- The knowledge that one's own peer evaluation will be examined does not +- Strong Hawthorne effect: improvement simply from knowing they are being studied + +## Slide 40: Conclusion +- A web application for peer evaluation assessment +- Study has positive effect on quality of peer evaluations +- Implications beyond peer evaluations diff --git a/documents/academic/presentations/master_defense_2014.pptx b/documents/academic/presentations/master_defense_2014.pptx new file mode 100644 index 0000000..06fbe34 Binary files /dev/null and b/documents/academic/presentations/master_defense_2014.pptx differ diff --git a/documents/academic/presentations/splicing_comp600_2018.key b/documents/academic/presentations/splicing_comp600_2018.key new file mode 100644 index 0000000..5a96bbd Binary files /dev/null and b/documents/academic/presentations/splicing_comp600_2018.key differ diff --git a/documents/academic/presentations/splicing_comp600_2018.md b/documents/academic/presentations/splicing_comp600_2018.md new file mode 100644 index 0000000..ef76fa3 --- /dev/null +++ b/documents/academic/presentations/splicing_comp600_2018.md @@ -0,0 +1,31 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2018-05 +source: splicing_comp600_2018.key +--- + +# Program Splicing — COMP 600 Spring 2018 (Keynote) + +Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. Keynote presentation, Spring 2018. + +Source Keynote file for the Program Splicing COMP 600 presentation. The PDF export is available as splicing_comp600_2018.pdf. + +The presentation covers the same content as splicing_comp600_slides_2018.pdf but is a slightly revised version with subtitle "Data-driven Program Synthesis" on the title slide: + +1. Copying and Pasting problem — time consuming and introduces bugs +2. Program Synthesis — automatically generating programs from specifications +3. Problem — can we use program synthesis to improve copy-paste? +4. Related work — Sketching (PLDI 2005), Code Transplantation (ISSTA 2015) +5. Program Splicing approach — automate copying/pasting using 3.5M program corpus, ensure correctness +6. Architecture — draft program → Synthesis ↔ PDB → completed program +7. PDB — 3.5M Java programs, natural language features, similarity metrics, KNN search, fast top-k query +8. Relevant programs — query PDB with draft program to find similar implementations +9. Filling holes — enumerative search over candidate expressions from relevant programs +10. Variable renaming — resolve undefined variables +11. Testing — filter incorrect candidates via unit tests +12. Heuristics — type and context-based pruning for search space reduction +13. Benchmark — 12 programs, synthesis times 3–161 seconds, efficient algorithm +14. User study — 18 participants (12 grad students + 6 professionals), 4 problems, splicing most helpful for algorithmic tasks and tasks without standard solutions +15. Conclusion — data-driven synthesis with large corpus, enumerative search, efficient algorithm, fast code reuse diff --git a/documents/academic/presentations/splicing_comp600_2018.pdf b/documents/academic/presentations/splicing_comp600_2018.pdf new file mode 100644 index 0000000..db5cf3a Binary files /dev/null and b/documents/academic/presentations/splicing_comp600_2018.pdf differ diff --git a/documents/academic/presentations/splicing_comp600_2018_pdf.md b/documents/academic/presentations/splicing_comp600_2018_pdf.md new file mode 100644 index 0000000..668e3e8 --- /dev/null +++ b/documents/academic/presentations/splicing_comp600_2018_pdf.md @@ -0,0 +1,64 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2018-05 +source: splicing_comp600_2018.pdf +--- + +# Program Splicing — COMP 600 Spring 2018 (PDF Export) + +Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. 31 slides. + +PDF export of the Keynote presentation splicing_comp600_2018.key. Title: "Program Splicing: Data-driven Program Synthesis". + +This is a revised version of the earlier splicing_comp600_slides_2018.pdf. Key differences: +- Title slide has subtitle "Data-driven Program Synthesis" (vs just "Presented by Yanxin Lu") +- Adds "Efficient relevant code retrieval" and "KNN search" to PDB slide +- Adds "Programming time" to user study setup +- User study result slides titled differently: "Deceptively simple", "No standard solutions", "Good documentations and tests were hard to write" +- Conclusion adds "Efficient algorithm", "Fast code reuse", "Easy to test", "Future work: synthesis algorithm improvement" + +## Slide 2: Title +Program Splicing: Data-driven Program Synthesis + +## Slides 3–7: Motivation and Approach +- Copying and pasting is time consuming and introduces bugs +- Program synthesis: automatically generate programs from specifications +- Problem: can we use program synthesis to improve copying and pasting? +- Related work: Sketching (PLDI 2005), Code Transplantation (ISSTA 2015) +- Program Splicing: automate process, large corpus (3.5M programs), ensure correctness + +## Slide 8: Demo +- How does a programmer use program splicing? + +## Slides 9–12: Architecture +- User → draft program → Synthesis ↔ PDB → completed program +- PDB: efficient relevant code retrieval, 3.5M Java programs, NL features, similarity metrics, KNN search, fast top-k query + +## Slides 13–18: Synthesis Algorithm +- Find relevant programs from PDB +- Fill holes via enumerative search +- Variable renaming for undefined variables +- Testing to filter incorrect programs + +## Slides 19–20: Benchmark +Same benchmark table as the earlier version. Efficient synthesis algorithm highlighted. + +## Slide 21: No need to write many tests + +## Slides 22–26: User study +- 18 participants, 4 problems, programming time measured +- Sieve: deceptively simple +- Files/CSV: no standard solutions — splicing most helpful +- HTML: good documentation and tests were hard to write + +## Slide 27: Conclusion +- Program Splicing: large code corpus, enumerative search, efficient algorithm +- Fast code reuse: no standard solutions, easy to test +- Future work: synthesis algorithm improvement + +## Slides 29–31: Appendix (Heuristics) +- Type-based pruning: ignore incompatible types +- Context-based pruning: ignore expressions with no common parents +- Huge search space reduction diff --git a/documents/academic/presentations/splicing_comp600_slides_2018.md b/documents/academic/presentations/splicing_comp600_slides_2018.md new file mode 100644 index 0000000..5277c18 --- /dev/null +++ b/documents/academic/presentations/splicing_comp600_slides_2018.md @@ -0,0 +1,95 @@ +--- +category: academic +type: academic +person: Yanxin Lu +date: 2018-04 +source: splicing_comp600_slides_2018.pdf +--- + +# Program Splicing — COMP 600 Slides (PDF) + +Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. Presented by Yanxin Lu. 31 slides. + +PDF export of COMP 600 presentation on Program Splicing. This is an earlier version of the presentation (title slide says "Presented by Yanxin Lu"). See also splicing_comp600_2018.pdf for a slightly revised version with subtitle "Data-driven Program Synthesis". + +## Slide 2: Title +Program Splicing — Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, David Melski. Presented by Yanxin Lu. + +## Slide 3: Copying and Pasting +- Problem: developers search online, copy code, adapt it — time consuming and bugs introduced + +## Slide 4: Program Synthesis +- Automatically generating programs +- Specification: logic formula, unit testing, natural language +- Correctness + +## Slide 5: Problem +Can we use program synthesis to help the process of copying and pasting? + +## Slide 6: Related work +- Sketch (FMCAD 2013) — cannot synthesize statements, does not use a code database +- Code Transplantation (ISSTA 2015) — not efficient, does not search for relevant code snippets + +## Slide 7: Program Splicing +- Use a large corpus of over 3.5 million programs +- Automate the process of copying and pasting +- Ensure correctness + +## Slide 8: Summary +- Architecture (corpus and Pliny database, synthesis algorithm) +- Experiment +- Conclusion + +## Slide 9–10: Architecture +- User provides draft program → Synthesis queries PDB → Top-k relevant programs → Completed program + +## Slide 11: PDB +- 3.5 million Java programs with features from GitHub, SourceForge +- Natural language terms: "read": 0.10976, "matrix": 0.65858, ... +- Similarity metrics, fast top-k query (1-2 orders of magnitude faster than no-SQL) + +## Slide 13: Relevant programs +- Draft program with holes + COMMENT/REQ specification → PDB returns similar programs + +## Slides 14–16: Filling in the holes +- Enumerative search: try candidate expressions from relevant programs +- Progressive selection of code fragments + +## Slide 17: Variable Renaming +- Resolve undefined variables by mapping from relevant program's variables + +## Slide 18: Testing +- Filter out incorrect programs using unit tests + +## Slide 19–20: Benchmark +| Benchmark | Synthesis Time (s) | LOC | Var | Holes (expr-stmt) | Test | uScalpel | +|---|---|---|---|---|---|---| +| Sieve Prime | 4.6 | 12-17 | 2 | 2-1 | 3 | 162.1 | +| Collision Detection | 4.2 | 10-15 | 2 | 2-1 | 4 | N/A | +| Collecting Files | 3.0 | 13-25 | 2 | 1-1 | 2 | timeout | +| Binary Search | 15.4 | 12-20 | 5 | 1-1 | 3 | timeout | +| HTTP Server | 41.1 | 24-45 | 6 | 1-2 | 2 | N/A | +| Prim's Distance Update | 61.1 | 53-58 | 11 | 1-1 | 4 | timeout | +| Quick Sort | 77.2 | 11-18 | 6 | 1-1 | 1 | timeout | +| CSV | 88.4 | 13-23 | 4 | 1-2 | 2 | timeout | +| Matrix Multiplication | 108.9 | 13-15 | 8 | 1-1 | 1 | timeout | +| Floyd Warshall | 110.4 | 9-12 | 7 | 1-1 | 7 | timeout | +| HTML Parsing | 140.4 | 20-34 | 5 | 1-2 | 2 | N/A | +| LCS | 161.5 | 29-36 | 10 | 0-1 | 1 | timeout | + +Synthesis algorithm is efficient. No need to write many tests. + +## Slides 22–26: User study +- 12 graduate students and 6 professionals +- Web-based programming environment +- 4 programming problems (2 with splicing, 2 without) +- Internet search encouraged +- Results: splicing reduced time for algorithmic tasks (sieve, files) +- Sieve: appears simple but was not (deceptively simple) +- Files/CSV: no standard solutions — splicing helps most +- HTML: good documentation and tests were hard to write + +## Slide 27: Conclusion +- Data-driven program synthesis using large code corpus +- Enumerative search +- User study: good for tasks without standard solutions diff --git a/documents/academic/presentations/splicing_comp600_slides_2018.pdf b/documents/academic/presentations/splicing_comp600_slides_2018.pdf new file mode 100644 index 0000000..9e227ee Binary files /dev/null and b/documents/academic/presentations/splicing_comp600_slides_2018.pdf differ