---
type: academic
category: academic
person: Yanxin Lu
date: 2018
source: writing.tex
---

# Thesis Introduction - LaTeX Source (writing.tex)

This is the LaTeX source file for the thesis introduction chapter. The compiled PDF version is available as `lu_writing.pdf`.

```latex
\chapter{Introduction}
\label{ch:intro}
With the advancement in technologies such as artificial intelligence
and also the expansions of high-tech companies, computer programming
starts to become an important skill, and the demand for programmers
has been growing dramatically in the past few years. The overall
productivity has been boosted significantly thanks to the increasing
number of programmers, but we still have not witnessed any boost in
individual programming productivity.

The most important reason is that programming is a difficult task. It
requires programmers to deal with extremely low-level details in
complex software projects, and it is almost inevitable for programmers
to make small mistakes. People tend to assume that a piece of untested
software does not function properly. To deal this the problem,
software engineering techniques and formal method based techniques
have been proposed to help facilitate programming. These techniques
include various software engineering methodologies, design patterns,
sophisticated testing methods, program repair algorithms, model
checking algorithms and program synthesis methods. Some techniques
such as software engineering methodologies, design patterns and unit
testing have been practical and useful in boosting programming
productivity and the industry has been adopting these techniques for
more than a decade. The main reason for its popularity and longevity
is that these techniques are quite easy to execute for average
programmers. However, one dominant problem with these software
engineering approaches is that they are not rigorous enough. If the
specification of a method is not followed strictly, its benefits will
tend to be hindered. Advance methods with more rules have been
proposed, but the specification tend to be vague sometimes, which
results in execution difficulties. 

Some researchers switched their attention to applying formal methods
to tackle the difficulties in programming. Methods such as model
checking and program synthesis are much more rigorous than traditional
software engineering techniques, and its performance and benefit is
guaranteed once everything works accordingly. However, the impact of
these formal methods technique is much less compared to the influence
brought by the software engineering techniques, and the reason is that
it is very likely that a formal method based approach will not work
when large input is provided, because it will not terminate and
produce any useful result due to its large search space. These large
search spaces are inevitable, since formal methods techniques
typically deal with extremely complex problems in theory. However,
people have been trying to make formal method approaches practical by
introducing additional hints~\cite{Srivastava2012} or by restricting
the problem domain~\cite{Gulwani2011spreadsheet, Gulwani2011,
  Gulwani2010}.

With the advent of ``big data'', researchers started to pay attention
to the problems that were considered difficult or impossible, and this
has led to a significant advancement in the area of machine
learning. Similarly, as more and more open source repositories such as
\verb|Google Code|, \verb|Github| and \verb|SourceForge| have come
online where thousands of software projects and their source code
become available, researchers from the programming language community
also started to consider using ``big code'' to tackle the problems
that were considered difficult. With the help of ``big code'', many
new techniques that use formal methods and aim to facilitate
programming have been proposed. These techniques include program
property prediction~\cite{mishne12, Raychev2015}, API sequence
prediction~\cite{Raychev2014, murali2017neural, murali2017bayesian}
and small program generations~\cite{balog2016deepcoder}, Researchers
have showed that using data can indeed make the problem of synthesis
feasible~\cite{balog2016deepcoder} and practical tools that can help
human developers have also started to appear and programmers have
started to use those in practice~\cite{Raychev2015, murali2017neural}.

Two major types of algorithms were used in the current literature of
applying formal methods to software engineering. The first type of
algorithms is based on combinatorial search. Combinatorial search
plays an important role in model checking and traditional program
synthesis problems~\cite{Manna1992, rajeev2013, lezama06, Long2015,
  Douskos2015, Pnueli1989, Alur2015, Feser2015, Gulwani2010}. The main
idea is to first define a goal and also the steps for reaching the
goal. Programmers can then let the computer to search for an
solution. Typically heuristics are defined to reduce the search space
and to speed up the search time. The advantages of search-based
methods include (1) it is relatively easy to implement and it can be
used to solve problems where no efficient solutions exist, (2)
sometimes the algorithms can discover results that are hard to think
about as humans because computers can easily discover solutions in a
large search space quickly compared to humans, and (3) search-based
methods can solve problems that requires precision and precision is
typically required for analyzing computer programs. As SAT solvers and
SMT solvers became sophisticated, people have been able to use those
fast solvers to gain significant performance boost. The biggest
drawback of search-based methods is its high algorithmic
complexity. The search space grows indefinitely as the input size
graduately increases and this is the main reason why most traditional
model checking methods and program synthesis algorithms cannot deal
with large programs~\cite{Gulwani2010}. Another drawback that is worth
mentioning is that search-based methods tend to be quite
fragile. Those methods typically require inputs at every step to be
extremely precise, or the algorithm would not perform as expected.

The second type of algorithms is based on learning. The idea of
learning is to let machine improve its performance using data in
solving a task and during the process learning-based methods are able
to capture idioms that are essential in solving the problem. These
idioms are typically hard to express or discover for humans. The large
amount of data was not available online until around 2012 and after
that researchers started applying learning-based methods to
programming systems~\cite{mishne12, Raychev2015, Raychev2014,
  murali2017neural, murali2017bayesian, balog2016deepcoder}. The
biggest advantage brought by ``big data'' or ``big code'' is that it
allows researchers to find idioms that reduce the search space
significantly by using machine learning techniques. Examples include
relationships between variable names and their semantics information
and API call sequence idioms. These idioms cannot be made available
without people analyzing a large amount of data. Another advantage
compared to search-based method is its robustness and this is because
machine learning algorithms tend to use a large amount of data where
small noises are suppressed. Even though data-driven programming
systems are quite impactful, learning-based methods are not as
accessible as search-based methods because learning-based methods tend
to require data. In order to make learning-based algorithms perform
well in practice, a large amount of data is typically required and
this also leads to a large consumption on time and computation
resources which might not be available for everyone.

In this thesis, we propose two additional corpus-driven systems that
aim to automate the process of software reuse and software
refactoring. In the current literature, the problem of software reuse
and refactoring have been both considered, but no systems can fully
automate software reuse and refactoring and some state-of-the-art
tools~\cite{Barr2015, balaban2005refactoring} still requires human to
provide additional hints. By using a large code corpus, we claim that
our systems can fully automate the process of software reuse and
refactoring without human intervention, and our systems can accomplish
the tasks efficiently and help human developers by boosting their
program productivity.

\section{Program reuse via splicing}
We first introduce {\em program splicing}, a programming system that
helps human developers by automating the process of software
reuse. The most popular workflow nowadays consists of copying, pasting,
and modifying code available online and the reason for its domination
is that it is relatively easy to execute with the help of internet
search. However, this process inherits the drawbacks from
programming. This process requires extreme precision and carefulness
from programmers similar to normal programming. When a software reuse
task happens in a large and complicated software system, the cost of
making mistakes and spending enormous time on repairing might exceed
the benefit, let alone the fact that programmers sometimes do not even
try to fully understand the code they bring in from the internet so
long as it appears to work under their specific software
environment. This might impose a threat to their future software
development progress.

Existing techniques that inspire the idea of our method can be divided
into two areas, search-based program synthesis techniques and
data-driven methods. The problem of program synthesis has been studied
for decades and researchers have been applying search-based methods to
tackle the problem for several decades~\cite{Pnueli1989, lezama06,
  Srivastava2012, Alur2015, Feser2015, yaghmazadeh2016}. The main
benefit with respect to this work comes from the fact that
search-based method can produce results that require precision. This
is quite crucial when we aim to generate code snippets that needs to
interact with pre-written software pieces and examples might include
matching variables that are semantically similar or
equivalent. However, the problem with search-based method is that it
does not scale well into handling large inputs, which lead to large
search spaces, due to the complexity of the problem, and this is the
main reason why one of the competing system, $\mu$Scalpel, is not as
efficient as our splicing method. To alleviate the scalability
problem, people have proved that using ``big data'' can be quite
effective~\cite{Raychev2015, Raychev2014, raychev2016,
  balog2016deepcoder, hindle2012naturalness}. Even though our splicing
method does not use any statistical method, we still reduce our search
space significantly and achieve high efficiency by relying on using natural
language to search a big code corpus~\cite{kashyap17}.

One of our novelty in this work is that we combine the ideas from
search-based methods and data-driven methods. To use our programming
system for program reuse, a programmer starts by writing a ``draft''
that mixes unfinished code, natural language comments, and correctness
requirements. A program synthesizer that interacts with a large,
searchable database of program snippets is used to automatically
complete the draft into a program that meets the requirements. The
synthesis process happens in two stages. First, the synthesizer
identifies a small number of programs in the
database~\cite{zou2018plinycompute} that are relevant to the synthesis
task. Next it uses an enumerative search to systematically fill the
draft with expressions and statements from these relevant
programs. The resulting program is returned to the programmer, who can
modify it and possibly invoke additional rounds of synthesis.

We present an implementation of program splicing, called \system, for
the Java programming language. \system uses a corpus of over 3.5
million procedures from an open-source software repository. Our
evaluation uses the system in a suite of everyday programming tasks,
and includes a comparison with a state-of-the-art competing
approach~\cite{Barr2015} as well as a user study. The results point to
the broad scope and scalability of program splicing and indicate that
the approach can significantly boost programmer productivity.

\section{API refactoring using natural language and API synthesizer}
Software refactoring typically involves reconstructing existing source
code without modifying the functionality, and it is important and
almost a daily routine that programmer will perform to keep their
software projects clean and organized by constructing better
abstractions, deleting duplicate codes, breaking down a big
functionalities into small pieces that are universally applicable and
etc. Software system maintenance is extremely crucial, because a
software system can easily deteriorate and become obsolete and useless
if maintenance is not done properly and regularly, especially when the
external libraries it uses and the other underlying software systems
it depends on evolve rapidly nowadays. After several decades of
software development, most professional programmers have realized the
importance of software refactoring, and software refactoring has been
used heavily and regularly in the software industry. Similar to
software reuse, software refactoring also inherits the drawbacks from
programming. It again requires extreme accuracy from programmers, and
programmers tend to make mistakes when they deal with large and
complex software systems which typically involves keeping tracking of
tens or even hundreds of variables and function components.

In this thesis, we focus on refactoring Application Programming
Interface (API) call sequences. An API consists of all the definitions
and usages of the resources available for external use from a software
system, and almost all software systems are built using various APIs
from other software systems nowadays. The process of API refactoring
mainly consists of changing the API call sequence defined in one
library into another sequence defined in another library. The benefit
of performing API refactoring is identical to general software
refactoring, but API refactoring has its specific benefits. The first
specific benefit allows programmers to reuse obsolete programs in which
programmers can adopt an obsolete programs into the existing
programming environment. Another benefit is that it can enhance the
performance of existing programs by refactoring the existing program
into another program that uses advanced libraries and platforms which
typically have better performance.

The main difficulty of API refactoring comes from discovering
semantically equivalent API calls between two libraries and how to
instantiate the new API calls using the environment's variables so
that the resulting API call sequence does not alter the functionality
of the original API call sequence. One of the earliest
work~\cite{balaban2005refactoring} that aims to help API refactoring
requires human interventions. The user of the system needs to formally
specify the mapping between the API calls in two libraries, and the
system only focuses on refactoring \emph{individual} API calls instead
of refactoring sequences. Subsequent research in the area of API
refactoring has been limited to the problem of API mapping or API
translation. The goal is to discover two API calls that are
semantically equivalent. Two types of methods were developed to solve
the problem of API translation. The first one involves aligning two
API call sequences using a statistical model and the translations can
be extracted from the alignment
results~\cite{gokhale2013inferring}. This alignment method allows
people to find not only one-to-one API translations but also
one-to-many API translations, but the downside is that this method
requires a large amount of API call sequences to train the underlying
statistical method. Another method relies on natural language features
such as Javadoc to find semantically equivalent API
calls~\cite{pandita2015discovering, nguyen2016mapping,
  zhong2009inferring}. Since Javadoc contains descriptions on the
nature of API calls, correct translations can be found by calculating
the similarities between the Javadoc texts of two API calls, and
calculating similarities can easily be done using a standard
\verb|Word2Vec| model which is able to calculate semantic similarities
between words. The only drawback of using natural language features as
the main glue is that it is difficult to discover one-to-many API
translations.

In this thesis, we propose a new algorithm that automates the process
of API refactoring by combining the natural language
technique~\cite{pandita2015discovering} and an state-of-the-art API
call sequence synthesizer called
\verb|Bayou|~\cite{murali2017neural}. The input to our algorithm
includes an API call sequence and the name of the destination library,
and our algorithm can produce another semantically equivalent sequence
that uses only the API calls defined in the destination library. We
solves the problem in two steps. We first translate the input API call
sequences into a set of stand-alone API calls defined in the
destination library using natural language features as the main
driver~\cite{pandita2015discovering, nguyen2016mapping}. Then we feed
the stand-alone API calls into a API sequence synthesizer called
\emph{Bayou}~\cite{murali2017neural} which in turn synthesizes a
complete sequence of API calls. We have designed a series of benchmark
problems to evaluate the accuracy of our API refactoring algorithm,
and here the accuracy is defined as the percentage of corrected
generated API calls. The results show that our algorithm is able to
refactor API call sequences accurately, given that the two involved
libraries have similar coding practices and the input sequence is not
rare in the training data.
```