Categories now match folder names (15 canonical values). Types normalized to 25 canonical values per VAULT_MAP.md spec. Context-aware mapping: W-2s→tax-form, lease files→lease, vet records→vet, etc.
302 lines
18 KiB
Markdown
302 lines
18 KiB
Markdown
---
|
|
type: academic
|
|
category: academic
|
|
person: Yanxin Lu
|
|
date: 2018
|
|
source: writing.tex
|
|
---
|
|
|
|
# Thesis Introduction - LaTeX Source (writing.tex)
|
|
|
|
This is the LaTeX source file for the thesis introduction chapter. The compiled PDF version is available as `lu_writing.pdf`.
|
|
|
|
```latex
|
|
\chapter{Introduction}
|
|
\label{ch:intro}
|
|
With the advancement in technologies such as artificial intelligence
|
|
and also the expansions of high-tech companies, computer programming
|
|
starts to become an important skill, and the demand for programmers
|
|
has been growing dramatically in the past few years. The overall
|
|
productivity has been boosted significantly thanks to the increasing
|
|
number of programmers, but we still have not witnessed any boost in
|
|
individual programming productivity.
|
|
|
|
The most important reason is that programming is a difficult task. It
|
|
requires programmers to deal with extremely low-level details in
|
|
complex software projects, and it is almost inevitable for programmers
|
|
to make small mistakes. People tend to assume that a piece of untested
|
|
software does not function properly. To deal this the problem,
|
|
software engineering techniques and formal method based techniques
|
|
have been proposed to help facilitate programming. These techniques
|
|
include various software engineering methodologies, design patterns,
|
|
sophisticated testing methods, program repair algorithms, model
|
|
checking algorithms and program synthesis methods. Some techniques
|
|
such as software engineering methodologies, design patterns and unit
|
|
testing have been practical and useful in boosting programming
|
|
productivity and the industry has been adopting these techniques for
|
|
more than a decade. The main reason for its popularity and longevity
|
|
is that these techniques are quite easy to execute for average
|
|
programmers. However, one dominant problem with these software
|
|
engineering approaches is that they are not rigorous enough. If the
|
|
specification of a method is not followed strictly, its benefits will
|
|
tend to be hindered. Advance methods with more rules have been
|
|
proposed, but the specification tend to be vague sometimes, which
|
|
results in execution difficulties.
|
|
|
|
Some researchers switched their attention to applying formal methods
|
|
to tackle the difficulties in programming. Methods such as model
|
|
checking and program synthesis are much more rigorous than traditional
|
|
software engineering techniques, and its performance and benefit is
|
|
guaranteed once everything works accordingly. However, the impact of
|
|
these formal methods technique is much less compared to the influence
|
|
brought by the software engineering techniques, and the reason is that
|
|
it is very likely that a formal method based approach will not work
|
|
when large input is provided, because it will not terminate and
|
|
produce any useful result due to its large search space. These large
|
|
search spaces are inevitable, since formal methods techniques
|
|
typically deal with extremely complex problems in theory. However,
|
|
people have been trying to make formal method approaches practical by
|
|
introducing additional hints~\cite{Srivastava2012} or by restricting
|
|
the problem domain~\cite{Gulwani2011spreadsheet, Gulwani2011,
|
|
Gulwani2010}.
|
|
|
|
With the advent of ``big data'', researchers started to pay attention
|
|
to the problems that were considered difficult or impossible, and this
|
|
has led to a significant advancement in the area of machine
|
|
learning. Similarly, as more and more open source repositories such as
|
|
\verb|Google Code|, \verb|Github| and \verb|SourceForge| have come
|
|
online where thousands of software projects and their source code
|
|
become available, researchers from the programming language community
|
|
also started to consider using ``big code'' to tackle the problems
|
|
that were considered difficult. With the help of ``big code'', many
|
|
new techniques that use formal methods and aim to facilitate
|
|
programming have been proposed. These techniques include program
|
|
property prediction~\cite{mishne12, Raychev2015}, API sequence
|
|
prediction~\cite{Raychev2014, murali2017neural, murali2017bayesian}
|
|
and small program generations~\cite{balog2016deepcoder}, Researchers
|
|
have showed that using data can indeed make the problem of synthesis
|
|
feasible~\cite{balog2016deepcoder} and practical tools that can help
|
|
human developers have also started to appear and programmers have
|
|
started to use those in practice~\cite{Raychev2015, murali2017neural}.
|
|
|
|
Two major types of algorithms were used in the current literature of
|
|
applying formal methods to software engineering. The first type of
|
|
algorithms is based on combinatorial search. Combinatorial search
|
|
plays an important role in model checking and traditional program
|
|
synthesis problems~\cite{Manna1992, rajeev2013, lezama06, Long2015,
|
|
Douskos2015, Pnueli1989, Alur2015, Feser2015, Gulwani2010}. The main
|
|
idea is to first define a goal and also the steps for reaching the
|
|
goal. Programmers can then let the computer to search for an
|
|
solution. Typically heuristics are defined to reduce the search space
|
|
and to speed up the search time. The advantages of search-based
|
|
methods include (1) it is relatively easy to implement and it can be
|
|
used to solve problems where no efficient solutions exist, (2)
|
|
sometimes the algorithms can discover results that are hard to think
|
|
about as humans because computers can easily discover solutions in a
|
|
large search space quickly compared to humans, and (3) search-based
|
|
methods can solve problems that requires precision and precision is
|
|
typically required for analyzing computer programs. As SAT solvers and
|
|
SMT solvers became sophisticated, people have been able to use those
|
|
fast solvers to gain significant performance boost. The biggest
|
|
drawback of search-based methods is its high algorithmic
|
|
complexity. The search space grows indefinitely as the input size
|
|
graduately increases and this is the main reason why most traditional
|
|
model checking methods and program synthesis algorithms cannot deal
|
|
with large programs~\cite{Gulwani2010}. Another drawback that is worth
|
|
mentioning is that search-based methods tend to be quite
|
|
fragile. Those methods typically require inputs at every step to be
|
|
extremely precise, or the algorithm would not perform as expected.
|
|
|
|
The second type of algorithms is based on learning. The idea of
|
|
learning is to let machine improve its performance using data in
|
|
solving a task and during the process learning-based methods are able
|
|
to capture idioms that are essential in solving the problem. These
|
|
idioms are typically hard to express or discover for humans. The large
|
|
amount of data was not available online until around 2012 and after
|
|
that researchers started applying learning-based methods to
|
|
programming systems~\cite{mishne12, Raychev2015, Raychev2014,
|
|
murali2017neural, murali2017bayesian, balog2016deepcoder}. The
|
|
biggest advantage brought by ``big data'' or ``big code'' is that it
|
|
allows researchers to find idioms that reduce the search space
|
|
significantly by using machine learning techniques. Examples include
|
|
relationships between variable names and their semantics information
|
|
and API call sequence idioms. These idioms cannot be made available
|
|
without people analyzing a large amount of data. Another advantage
|
|
compared to search-based method is its robustness and this is because
|
|
machine learning algorithms tend to use a large amount of data where
|
|
small noises are suppressed. Even though data-driven programming
|
|
systems are quite impactful, learning-based methods are not as
|
|
accessible as search-based methods because learning-based methods tend
|
|
to require data. In order to make learning-based algorithms perform
|
|
well in practice, a large amount of data is typically required and
|
|
this also leads to a large consumption on time and computation
|
|
resources which might not be available for everyone.
|
|
|
|
In this thesis, we propose two additional corpus-driven systems that
|
|
aim to automate the process of software reuse and software
|
|
refactoring. In the current literature, the problem of software reuse
|
|
and refactoring have been both considered, but no systems can fully
|
|
automate software reuse and refactoring and some state-of-the-art
|
|
tools~\cite{Barr2015, balaban2005refactoring} still requires human to
|
|
provide additional hints. By using a large code corpus, we claim that
|
|
our systems can fully automate the process of software reuse and
|
|
refactoring without human intervention, and our systems can accomplish
|
|
the tasks efficiently and help human developers by boosting their
|
|
program productivity.
|
|
|
|
\section{Program reuse via splicing}
|
|
We first introduce {\em program splicing}, a programming system that
|
|
helps human developers by automating the process of software
|
|
reuse. The most popular workflow nowadays consists of copying, pasting,
|
|
and modifying code available online and the reason for its domination
|
|
is that it is relatively easy to execute with the help of internet
|
|
search. However, this process inherits the drawbacks from
|
|
programming. This process requires extreme precision and carefulness
|
|
from programmers similar to normal programming. When a software reuse
|
|
task happens in a large and complicated software system, the cost of
|
|
making mistakes and spending enormous time on repairing might exceed
|
|
the benefit, let alone the fact that programmers sometimes do not even
|
|
try to fully understand the code they bring in from the internet so
|
|
long as it appears to work under their specific software
|
|
environment. This might impose a threat to their future software
|
|
development progress.
|
|
|
|
Existing techniques that inspire the idea of our method can be divided
|
|
into two areas, search-based program synthesis techniques and
|
|
data-driven methods. The problem of program synthesis has been studied
|
|
for decades and researchers have been applying search-based methods to
|
|
tackle the problem for several decades~\cite{Pnueli1989, lezama06,
|
|
Srivastava2012, Alur2015, Feser2015, yaghmazadeh2016}. The main
|
|
benefit with respect to this work comes from the fact that
|
|
search-based method can produce results that require precision. This
|
|
is quite crucial when we aim to generate code snippets that needs to
|
|
interact with pre-written software pieces and examples might include
|
|
matching variables that are semantically similar or
|
|
equivalent. However, the problem with search-based method is that it
|
|
does not scale well into handling large inputs, which lead to large
|
|
search spaces, due to the complexity of the problem, and this is the
|
|
main reason why one of the competing system, $\mu$Scalpel, is not as
|
|
efficient as our splicing method. To alleviate the scalability
|
|
problem, people have proved that using ``big data'' can be quite
|
|
effective~\cite{Raychev2015, Raychev2014, raychev2016,
|
|
balog2016deepcoder, hindle2012naturalness}. Even though our splicing
|
|
method does not use any statistical method, we still reduce our search
|
|
space significantly and achieve high efficiency by relying on using natural
|
|
language to search a big code corpus~\cite{kashyap17}.
|
|
|
|
One of our novelty in this work is that we combine the ideas from
|
|
search-based methods and data-driven methods. To use our programming
|
|
system for program reuse, a programmer starts by writing a ``draft''
|
|
that mixes unfinished code, natural language comments, and correctness
|
|
requirements. A program synthesizer that interacts with a large,
|
|
searchable database of program snippets is used to automatically
|
|
complete the draft into a program that meets the requirements. The
|
|
synthesis process happens in two stages. First, the synthesizer
|
|
identifies a small number of programs in the
|
|
database~\cite{zou2018plinycompute} that are relevant to the synthesis
|
|
task. Next it uses an enumerative search to systematically fill the
|
|
draft with expressions and statements from these relevant
|
|
programs. The resulting program is returned to the programmer, who can
|
|
modify it and possibly invoke additional rounds of synthesis.
|
|
|
|
We present an implementation of program splicing, called \system, for
|
|
the Java programming language. \system uses a corpus of over 3.5
|
|
million procedures from an open-source software repository. Our
|
|
evaluation uses the system in a suite of everyday programming tasks,
|
|
and includes a comparison with a state-of-the-art competing
|
|
approach~\cite{Barr2015} as well as a user study. The results point to
|
|
the broad scope and scalability of program splicing and indicate that
|
|
the approach can significantly boost programmer productivity.
|
|
|
|
\section{API refactoring using natural language and API synthesizer}
|
|
Software refactoring typically involves reconstructing existing source
|
|
code without modifying the functionality, and it is important and
|
|
almost a daily routine that programmer will perform to keep their
|
|
software projects clean and organized by constructing better
|
|
abstractions, deleting duplicate codes, breaking down a big
|
|
functionalities into small pieces that are universally applicable and
|
|
etc. Software system maintenance is extremely crucial, because a
|
|
software system can easily deteriorate and become obsolete and useless
|
|
if maintenance is not done properly and regularly, especially when the
|
|
external libraries it uses and the other underlying software systems
|
|
it depends on evolve rapidly nowadays. After several decades of
|
|
software development, most professional programmers have realized the
|
|
importance of software refactoring, and software refactoring has been
|
|
used heavily and regularly in the software industry. Similar to
|
|
software reuse, software refactoring also inherits the drawbacks from
|
|
programming. It again requires extreme accuracy from programmers, and
|
|
programmers tend to make mistakes when they deal with large and
|
|
complex software systems which typically involves keeping tracking of
|
|
tens or even hundreds of variables and function components.
|
|
|
|
In this thesis, we focus on refactoring Application Programming
|
|
Interface (API) call sequences. An API consists of all the definitions
|
|
and usages of the resources available for external use from a software
|
|
system, and almost all software systems are built using various APIs
|
|
from other software systems nowadays. The process of API refactoring
|
|
mainly consists of changing the API call sequence defined in one
|
|
library into another sequence defined in another library. The benefit
|
|
of performing API refactoring is identical to general software
|
|
refactoring, but API refactoring has its specific benefits. The first
|
|
specific benefit allows programmers to reuse obsolete programs in which
|
|
programmers can adopt an obsolete programs into the existing
|
|
programming environment. Another benefit is that it can enhance the
|
|
performance of existing programs by refactoring the existing program
|
|
into another program that uses advanced libraries and platforms which
|
|
typically have better performance.
|
|
|
|
The main difficulty of API refactoring comes from discovering
|
|
semantically equivalent API calls between two libraries and how to
|
|
instantiate the new API calls using the environment's variables so
|
|
that the resulting API call sequence does not alter the functionality
|
|
of the original API call sequence. One of the earliest
|
|
work~\cite{balaban2005refactoring} that aims to help API refactoring
|
|
requires human interventions. The user of the system needs to formally
|
|
specify the mapping between the API calls in two libraries, and the
|
|
system only focuses on refactoring \emph{individual} API calls instead
|
|
of refactoring sequences. Subsequent research in the area of API
|
|
refactoring has been limited to the problem of API mapping or API
|
|
translation. The goal is to discover two API calls that are
|
|
semantically equivalent. Two types of methods were developed to solve
|
|
the problem of API translation. The first one involves aligning two
|
|
API call sequences using a statistical model and the translations can
|
|
be extracted from the alignment
|
|
results~\cite{gokhale2013inferring}. This alignment method allows
|
|
people to find not only one-to-one API translations but also
|
|
one-to-many API translations, but the downside is that this method
|
|
requires a large amount of API call sequences to train the underlying
|
|
statistical method. Another method relies on natural language features
|
|
such as Javadoc to find semantically equivalent API
|
|
calls~\cite{pandita2015discovering, nguyen2016mapping,
|
|
zhong2009inferring}. Since Javadoc contains descriptions on the
|
|
nature of API calls, correct translations can be found by calculating
|
|
the similarities between the Javadoc texts of two API calls, and
|
|
calculating similarities can easily be done using a standard
|
|
\verb|Word2Vec| model which is able to calculate semantic similarities
|
|
between words. The only drawback of using natural language features as
|
|
the main glue is that it is difficult to discover one-to-many API
|
|
translations.
|
|
|
|
In this thesis, we propose a new algorithm that automates the process
|
|
of API refactoring by combining the natural language
|
|
technique~\cite{pandita2015discovering} and an state-of-the-art API
|
|
call sequence synthesizer called
|
|
\verb|Bayou|~\cite{murali2017neural}. The input to our algorithm
|
|
includes an API call sequence and the name of the destination library,
|
|
and our algorithm can produce another semantically equivalent sequence
|
|
that uses only the API calls defined in the destination library. We
|
|
solves the problem in two steps. We first translate the input API call
|
|
sequences into a set of stand-alone API calls defined in the
|
|
destination library using natural language features as the main
|
|
driver~\cite{pandita2015discovering, nguyen2016mapping}. Then we feed
|
|
the stand-alone API calls into a API sequence synthesizer called
|
|
\emph{Bayou}~\cite{murali2017neural} which in turn synthesizes a
|
|
complete sequence of API calls. We have designed a series of benchmark
|
|
problems to evaluate the accuracy of our API refactoring algorithm,
|
|
and here the accuracy is defined as the percentage of corrected
|
|
generated API calls. The results show that our algorithm is able to
|
|
refactor API call sequences accurately, given that the two involved
|
|
libraries have similar coding practices and the input sequence is not
|
|
rare in the training data.
|
|
```
|