--- type: academic category: academic person: Yanxin Lu date: 2018 source: writing.tex --- # Thesis Introduction - LaTeX Source (writing.tex) This is the LaTeX source file for the thesis introduction chapter. The compiled PDF version is available as `lu_writing.pdf`. ```latex \chapter{Introduction} \label{ch:intro} With the advancement in technologies such as artificial intelligence and also the expansions of high-tech companies, computer programming starts to become an important skill, and the demand for programmers has been growing dramatically in the past few years. The overall productivity has been boosted significantly thanks to the increasing number of programmers, but we still have not witnessed any boost in individual programming productivity. The most important reason is that programming is a difficult task. It requires programmers to deal with extremely low-level details in complex software projects, and it is almost inevitable for programmers to make small mistakes. People tend to assume that a piece of untested software does not function properly. To deal this the problem, software engineering techniques and formal method based techniques have been proposed to help facilitate programming. These techniques include various software engineering methodologies, design patterns, sophisticated testing methods, program repair algorithms, model checking algorithms and program synthesis methods. Some techniques such as software engineering methodologies, design patterns and unit testing have been practical and useful in boosting programming productivity and the industry has been adopting these techniques for more than a decade. The main reason for its popularity and longevity is that these techniques are quite easy to execute for average programmers. However, one dominant problem with these software engineering approaches is that they are not rigorous enough. If the specification of a method is not followed strictly, its benefits will tend to be hindered. Advance methods with more rules have been proposed, but the specification tend to be vague sometimes, which results in execution difficulties. Some researchers switched their attention to applying formal methods to tackle the difficulties in programming. Methods such as model checking and program synthesis are much more rigorous than traditional software engineering techniques, and its performance and benefit is guaranteed once everything works accordingly. However, the impact of these formal methods technique is much less compared to the influence brought by the software engineering techniques, and the reason is that it is very likely that a formal method based approach will not work when large input is provided, because it will not terminate and produce any useful result due to its large search space. These large search spaces are inevitable, since formal methods techniques typically deal with extremely complex problems in theory. However, people have been trying to make formal method approaches practical by introducing additional hints~\cite{Srivastava2012} or by restricting the problem domain~\cite{Gulwani2011spreadsheet, Gulwani2011, Gulwani2010}. With the advent of ``big data'', researchers started to pay attention to the problems that were considered difficult or impossible, and this has led to a significant advancement in the area of machine learning. Similarly, as more and more open source repositories such as \verb|Google Code|, \verb|Github| and \verb|SourceForge| have come online where thousands of software projects and their source code become available, researchers from the programming language community also started to consider using ``big code'' to tackle the problems that were considered difficult. With the help of ``big code'', many new techniques that use formal methods and aim to facilitate programming have been proposed. These techniques include program property prediction~\cite{mishne12, Raychev2015}, API sequence prediction~\cite{Raychev2014, murali2017neural, murali2017bayesian} and small program generations~\cite{balog2016deepcoder}, Researchers have showed that using data can indeed make the problem of synthesis feasible~\cite{balog2016deepcoder} and practical tools that can help human developers have also started to appear and programmers have started to use those in practice~\cite{Raychev2015, murali2017neural}. Two major types of algorithms were used in the current literature of applying formal methods to software engineering. The first type of algorithms is based on combinatorial search. Combinatorial search plays an important role in model checking and traditional program synthesis problems~\cite{Manna1992, rajeev2013, lezama06, Long2015, Douskos2015, Pnueli1989, Alur2015, Feser2015, Gulwani2010}. The main idea is to first define a goal and also the steps for reaching the goal. Programmers can then let the computer to search for an solution. Typically heuristics are defined to reduce the search space and to speed up the search time. The advantages of search-based methods include (1) it is relatively easy to implement and it can be used to solve problems where no efficient solutions exist, (2) sometimes the algorithms can discover results that are hard to think about as humans because computers can easily discover solutions in a large search space quickly compared to humans, and (3) search-based methods can solve problems that requires precision and precision is typically required for analyzing computer programs. As SAT solvers and SMT solvers became sophisticated, people have been able to use those fast solvers to gain significant performance boost. The biggest drawback of search-based methods is its high algorithmic complexity. The search space grows indefinitely as the input size graduately increases and this is the main reason why most traditional model checking methods and program synthesis algorithms cannot deal with large programs~\cite{Gulwani2010}. Another drawback that is worth mentioning is that search-based methods tend to be quite fragile. Those methods typically require inputs at every step to be extremely precise, or the algorithm would not perform as expected. The second type of algorithms is based on learning. The idea of learning is to let machine improve its performance using data in solving a task and during the process learning-based methods are able to capture idioms that are essential in solving the problem. These idioms are typically hard to express or discover for humans. The large amount of data was not available online until around 2012 and after that researchers started applying learning-based methods to programming systems~\cite{mishne12, Raychev2015, Raychev2014, murali2017neural, murali2017bayesian, balog2016deepcoder}. The biggest advantage brought by ``big data'' or ``big code'' is that it allows researchers to find idioms that reduce the search space significantly by using machine learning techniques. Examples include relationships between variable names and their semantics information and API call sequence idioms. These idioms cannot be made available without people analyzing a large amount of data. Another advantage compared to search-based method is its robustness and this is because machine learning algorithms tend to use a large amount of data where small noises are suppressed. Even though data-driven programming systems are quite impactful, learning-based methods are not as accessible as search-based methods because learning-based methods tend to require data. In order to make learning-based algorithms perform well in practice, a large amount of data is typically required and this also leads to a large consumption on time and computation resources which might not be available for everyone. In this thesis, we propose two additional corpus-driven systems that aim to automate the process of software reuse and software refactoring. In the current literature, the problem of software reuse and refactoring have been both considered, but no systems can fully automate software reuse and refactoring and some state-of-the-art tools~\cite{Barr2015, balaban2005refactoring} still requires human to provide additional hints. By using a large code corpus, we claim that our systems can fully automate the process of software reuse and refactoring without human intervention, and our systems can accomplish the tasks efficiently and help human developers by boosting their program productivity. \section{Program reuse via splicing} We first introduce {\em program splicing}, a programming system that helps human developers by automating the process of software reuse. The most popular workflow nowadays consists of copying, pasting, and modifying code available online and the reason for its domination is that it is relatively easy to execute with the help of internet search. However, this process inherits the drawbacks from programming. This process requires extreme precision and carefulness from programmers similar to normal programming. When a software reuse task happens in a large and complicated software system, the cost of making mistakes and spending enormous time on repairing might exceed the benefit, let alone the fact that programmers sometimes do not even try to fully understand the code they bring in from the internet so long as it appears to work under their specific software environment. This might impose a threat to their future software development progress. Existing techniques that inspire the idea of our method can be divided into two areas, search-based program synthesis techniques and data-driven methods. The problem of program synthesis has been studied for decades and researchers have been applying search-based methods to tackle the problem for several decades~\cite{Pnueli1989, lezama06, Srivastava2012, Alur2015, Feser2015, yaghmazadeh2016}. The main benefit with respect to this work comes from the fact that search-based method can produce results that require precision. This is quite crucial when we aim to generate code snippets that needs to interact with pre-written software pieces and examples might include matching variables that are semantically similar or equivalent. However, the problem with search-based method is that it does not scale well into handling large inputs, which lead to large search spaces, due to the complexity of the problem, and this is the main reason why one of the competing system, $\mu$Scalpel, is not as efficient as our splicing method. To alleviate the scalability problem, people have proved that using ``big data'' can be quite effective~\cite{Raychev2015, Raychev2014, raychev2016, balog2016deepcoder, hindle2012naturalness}. Even though our splicing method does not use any statistical method, we still reduce our search space significantly and achieve high efficiency by relying on using natural language to search a big code corpus~\cite{kashyap17}. One of our novelty in this work is that we combine the ideas from search-based methods and data-driven methods. To use our programming system for program reuse, a programmer starts by writing a ``draft'' that mixes unfinished code, natural language comments, and correctness requirements. A program synthesizer that interacts with a large, searchable database of program snippets is used to automatically complete the draft into a program that meets the requirements. The synthesis process happens in two stages. First, the synthesizer identifies a small number of programs in the database~\cite{zou2018plinycompute} that are relevant to the synthesis task. Next it uses an enumerative search to systematically fill the draft with expressions and statements from these relevant programs. The resulting program is returned to the programmer, who can modify it and possibly invoke additional rounds of synthesis. We present an implementation of program splicing, called \system, for the Java programming language. \system uses a corpus of over 3.5 million procedures from an open-source software repository. Our evaluation uses the system in a suite of everyday programming tasks, and includes a comparison with a state-of-the-art competing approach~\cite{Barr2015} as well as a user study. The results point to the broad scope and scalability of program splicing and indicate that the approach can significantly boost programmer productivity. \section{API refactoring using natural language and API synthesizer} Software refactoring typically involves reconstructing existing source code without modifying the functionality, and it is important and almost a daily routine that programmer will perform to keep their software projects clean and organized by constructing better abstractions, deleting duplicate codes, breaking down a big functionalities into small pieces that are universally applicable and etc. Software system maintenance is extremely crucial, because a software system can easily deteriorate and become obsolete and useless if maintenance is not done properly and regularly, especially when the external libraries it uses and the other underlying software systems it depends on evolve rapidly nowadays. After several decades of software development, most professional programmers have realized the importance of software refactoring, and software refactoring has been used heavily and regularly in the software industry. Similar to software reuse, software refactoring also inherits the drawbacks from programming. It again requires extreme accuracy from programmers, and programmers tend to make mistakes when they deal with large and complex software systems which typically involves keeping tracking of tens or even hundreds of variables and function components. In this thesis, we focus on refactoring Application Programming Interface (API) call sequences. An API consists of all the definitions and usages of the resources available for external use from a software system, and almost all software systems are built using various APIs from other software systems nowadays. The process of API refactoring mainly consists of changing the API call sequence defined in one library into another sequence defined in another library. The benefit of performing API refactoring is identical to general software refactoring, but API refactoring has its specific benefits. The first specific benefit allows programmers to reuse obsolete programs in which programmers can adopt an obsolete programs into the existing programming environment. Another benefit is that it can enhance the performance of existing programs by refactoring the existing program into another program that uses advanced libraries and platforms which typically have better performance. The main difficulty of API refactoring comes from discovering semantically equivalent API calls between two libraries and how to instantiate the new API calls using the environment's variables so that the resulting API call sequence does not alter the functionality of the original API call sequence. One of the earliest work~\cite{balaban2005refactoring} that aims to help API refactoring requires human interventions. The user of the system needs to formally specify the mapping between the API calls in two libraries, and the system only focuses on refactoring \emph{individual} API calls instead of refactoring sequences. Subsequent research in the area of API refactoring has been limited to the problem of API mapping or API translation. The goal is to discover two API calls that are semantically equivalent. Two types of methods were developed to solve the problem of API translation. The first one involves aligning two API call sequences using a statistical model and the translations can be extracted from the alignment results~\cite{gokhale2013inferring}. This alignment method allows people to find not only one-to-one API translations but also one-to-many API translations, but the downside is that this method requires a large amount of API call sequences to train the underlying statistical method. Another method relies on natural language features such as Javadoc to find semantically equivalent API calls~\cite{pandita2015discovering, nguyen2016mapping, zhong2009inferring}. Since Javadoc contains descriptions on the nature of API calls, correct translations can be found by calculating the similarities between the Javadoc texts of two API calls, and calculating similarities can easily be done using a standard \verb|Word2Vec| model which is able to calculate semantic similarities between words. The only drawback of using natural language features as the main glue is that it is difficult to discover one-to-many API translations. In this thesis, we propose a new algorithm that automates the process of API refactoring by combining the natural language technique~\cite{pandita2015discovering} and an state-of-the-art API call sequence synthesizer called \verb|Bayou|~\cite{murali2017neural}. The input to our algorithm includes an API call sequence and the name of the destination library, and our algorithm can produce another semantically equivalent sequence that uses only the API calls defined in the destination library. We solves the problem in two steps. We first translate the input API call sequences into a set of stand-alone API calls defined in the destination library using natural language features as the main driver~\cite{pandita2015discovering, nguyen2016mapping}. Then we feed the stand-alone API calls into a API sequence synthesizer called \emph{Bayou}~\cite{murali2017neural} which in turn synthesizes a complete sequence of API calls. We have designed a series of benchmark problems to evaluate the accuracy of our API refactoring algorithm, and here the accuracy is defined as the percentage of corrected generated API calls. The results show that our algorithm is able to refactor API call sequences accurately, given that the two involved libraries have similar coding practices and the input sequence is not rare in the training data. ```