Categories now match folder names (15 canonical values). Types normalized to 25 canonical values per VAULT_MAP.md spec. Context-aware mapping: W-2s→tax-form, lease files→lease, vet records→vet, etc.
5.2 KiB
type, category, person, date, source
| type | category | person | date | source |
|---|---|---|---|---|
| academic | academic | Yanxin Lu | 2019 | thesis_final.pdf |
RICE UNIVERSITY
Corpus-Driven Systems for Program Synthesis and Refactoring
by
Yanxin Lu
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree
Doctor of Philosophy
Approved, Thesis Committee:
Swarat Chaudhuri, Chair Associate Professor of Computer Science [Signature]
Christopher Jermaine Professor of Computer Science [Signature]
Ankit B. Patel Assistant Professor of Electrical and Computer Engineering [Signature]
Houston, Texas
April, 2019
Abstract
Corpus-Driven Systems for Program Synthesis and Refactoring
by
Yanxin Lu
Software development is a difficult task. Programmers need to work with many small components in large software projects which typically contain more than thousands of lines of code. To make software development manageable, developers and researchers have deployed various programming systems and tools. These include the ones that can facilitate refactoring existing source code and even generate programs automatically. One problem with traditional program synthesis tools is that they cannot generate practical results when given large specifications due to its high complexity of the underlying problem. Furthermore, existing refactoring systems can only refactor individual components separately and fail to instantiate complete programs. To overcome these problems, we can learn useful patterns and idioms from large code corpora using machine learning techniques. Researchers have used "big code" and developed novel and practical programming tools such as Bayou [1] and JSNice [2]. In this thesis, we present two data-driven programming systems for software reuse and refactoring.
We first introduce program splicing, a programming methodology that aims to automate the workflow of copying, pasting, and modifying code available online. Here, the programmer starts by writing a "draft" that mixes unfinished code, natural language comments, and correctness requirements. A program synthesizer that interacts with a large, searchable database of program snippets is used to automatically complete the draft into a program that meets the requirements. Our evaluation uses the system in a suite of everyday programming tasks and includes a comparison with a state-of-the-art competing approach as well as a user study. The results point to the broad scope and scalability of program splicing and indicate that the approach can significantly boost programmer productivity.
Next, we propose an algorithm that automates the process of API refactoring, where the goal is to rewrite an API call sequence into another sequence that only uses the API calls defined in the target library without modifying the functionality. We solve the problem of API refactoring by combining the techniques of API translation and API sequence synthesis. Specifically, we first translate original API calls into a set of new API calls defined in the target library. Then we use an API synthesizer to generate a complete program that uses the translated API calls. We evaluated our algorithm on a diverse set of benchmark problems, and our algorithm can refactor API sequences with high accuracy.
Although the evaluations of the techniques presented in this thesis are quite optimistic, we believe that there is room for improvement by using more sophisticated language model and advanced search algorithm for program splicing. To improve our API refactoring method, one can train statistical models by using existing API call sequence pairs. Besides these potential improvements, many problems related to "big code" still remain, and the potential of using a data-driven method to help programming is enormous.
Contents
- Abstract ... ii
- List of Illustrations ... vi
- List of Tables ... viii
1 Introduction ... 1
- 1.1 Program reuse via splicing ... 6
- 1.2 API refactoring using natural language and API synthesizer ... 8
- 1.3 Summary ... 11
2 Program Splicing ... 12
- 2.1 Introduction ... 12
- 2.2 Motivating Examples ... 14
- 2.2.1 Reading a Matrix from a CSV File ... 14
- 2.2.2 Face Detection using OpenCV ... 19
- 2.3 Problem formulation ... 21
- 2.4 Method ... 24
- 2.4.1 Searching for programs ... 24
- 2.4.2 Program completion ... 26
- 2.5 Evaluation ... 31
- 2.5.1 Benchmarks ... 32
- 2.5.2 Experiments ... 35
- 2.6 Summary ... 45
3 API Refactoring ... 46
- 3.1 Introduction ... 46
- 3.2 Motivating Examples ... 49
- 3.3 Problem Definition ... 54
- 3.4 Method ... 55
- 3.4.1 API Translation ... 56
- 3.4.2 API Call Sequence Synthesis ... 58
- 3.5 Evaluation ... 62
- 3.5.1 Benchmarks ... 62
- 3.5.2 Experiments ... 63
- 3.5.3 Limitations ... 69
- 3.6 Summary ... 69
4 Related Work ... 71
- 4.1 Program Synthesis and Reuse ... 71
- 4.2 Data-driven Program Synthesis ... 74
- 4.3 Code Search ... 77
- 4.4 API Refactoring and Translation ... 81
5 Conclusion and Future Work ... 85
Bibliography ... 89
Note: This is the final signed version of the thesis (1.6MB). The full thesis contains 95+ pages of technical content including figures, tables, algorithms, code examples, experimental results, and bibliography. The complete content is preserved in the PDF.