vault backup: 2026-04-05 17:46:47
This commit is contained in:
287
documents/academic/rice_engi601/lu_writing.tex
Normal file
287
documents/academic/rice_engi601/lu_writing.tex
Normal file
@@ -0,0 +1,287 @@
|
||||
\chapter{Introduction}
|
||||
\label{ch:intro}
|
||||
With the advancement in technologies such as artificial intelligence
|
||||
and also the expansions of high-tech companies, computer programming
|
||||
starts to become an important skill, and the demand for programmers
|
||||
has been growing dramatically in the past few years. The overall
|
||||
productivity has been boosted significantly thanks to the increasing
|
||||
number of programmers, but we still have not witnessed any boost in
|
||||
individual programming productivity.
|
||||
|
||||
The most important reason is that programming is a difficult task. It
|
||||
requires programmers to deal with extremely low-level details in
|
||||
complex software projects, and it is almost inevitable for programmers
|
||||
to make small mistakes. People tend to assume that a piece of untested
|
||||
software does not function properly. To deal this the problem,
|
||||
software engineering techniques and formal method based techniques
|
||||
have been proposed to help facilitate programming. These techniques
|
||||
include various software engineering methodologies, design patterns,
|
||||
sophisticated testing methods, program repair algorithms, model
|
||||
checking algorithms and program synthesis methods. Some techniques
|
||||
such as software engineering methodologies, design patterns and unit
|
||||
testing have been practical and useful in boosting programming
|
||||
productivity and the industry has been adopting these techniques for
|
||||
more than a decade. The main reason for its popularity and longevity
|
||||
is that these techniques are quite easy to execute for average
|
||||
programmers. However, one dominant problem with these software
|
||||
engineering approaches is that they are not rigorous enough. If the
|
||||
specification of a method is not followed strictly, its benefits will
|
||||
tend to be hindered. Advance methods with more rules have been
|
||||
proposed, but the specification tend to be vague sometimes, which
|
||||
results in execution difficulties.
|
||||
|
||||
Some researchers switched their attention to applying formal methods
|
||||
to tackle the difficulties in programming. Methods such as model
|
||||
checking and program synthesis are much more rigorous than traditional
|
||||
software engineering techniques, and its performance and benefit is
|
||||
guaranteed once everything works accordingly. However, the impact of
|
||||
these formal methods technique is much less compared to the influence
|
||||
brought by the software engineering techniques, and the reason is that
|
||||
it is very likely that a formal method based approach will not work
|
||||
when large input is provided, because it will not terminate and
|
||||
produce any useful result due to its large search space. These large
|
||||
search spaces are inevitable, since formal methods techniques
|
||||
typically deal with extremely complex problems in theory. However,
|
||||
people have been trying to make formal method approaches practical by
|
||||
introducing additional hints~\cite{Srivastava2012} or by restricting
|
||||
the problem domain~\cite{Gulwani2011spreadsheet, Gulwani2011,
|
||||
Gulwani2010}.
|
||||
|
||||
With the advent of ``big data'', researchers started to pay attention
|
||||
to the problems that were considered difficult or impossible, and this
|
||||
has led to a significant advancement in the area of machine
|
||||
learning. Similarly, as more and more open source repositories such as
|
||||
\verb|Google Code|, \verb|Github| and \verb|SourceForge| have come
|
||||
online where thousands of software projects and their source code
|
||||
become available, researchers from the programming language community
|
||||
also started to consider using ``big code'' to tackle the problems
|
||||
that were considered difficult. With the help of ``big code'', many
|
||||
new techniques that use formal methods and aim to facilitate
|
||||
programming have been proposed. These techniques include program
|
||||
property prediction~\cite{mishne12, Raychev2015}, API sequence
|
||||
prediction~\cite{Raychev2014, murali2017neural, murali2017bayesian}
|
||||
and small program generations~\cite{balog2016deepcoder}, Researchers
|
||||
have showed that using data can indeed make the problem of synthesis
|
||||
feasible~\cite{balog2016deepcoder} and practical tools that can help
|
||||
human developers have also started to appear and programmers have
|
||||
started to use those in practice~\cite{Raychev2015, murali2017neural}.
|
||||
|
||||
Two major types of algorithms were used in the current literature of
|
||||
applying formal methods to software engineering. The first type of
|
||||
algorithms is based on combinatorial search. Combinatorial search
|
||||
plays an important role in model checking and traditional program
|
||||
synthesis problems~\cite{Manna1992, rajeev2013, lezama06, Long2015,
|
||||
Douskos2015, Pnueli1989, Alur2015, Feser2015, Gulwani2010}. The main
|
||||
idea is to first define a goal and also the steps for reaching the
|
||||
goal. Programmers can then let the computer to search for an
|
||||
solution. Typically heuristics are defined to reduce the search space
|
||||
and to speed up the search time. The advantages of search-based
|
||||
methods include (1) it is relatively easy to implement and it can be
|
||||
used to solve problems where no efficient solutions exist, (2)
|
||||
sometimes the algorithms can discover results that are hard to think
|
||||
about as humans because computers can easily discover solutions in a
|
||||
large search space quickly compared to humans, and (3) search-based
|
||||
methods can solve problems that requires precision and precision is
|
||||
typically required for analyzing computer programs. As SAT solvers and
|
||||
SMT solvers became sophisticated, people have been able to use those
|
||||
fast solvers to gain significant performance boost. The biggest
|
||||
drawback of search-based methods is its high algorithmic
|
||||
complexity. The search space grows indefinitely as the input size
|
||||
graduately increases and this is the main reason why most traditional
|
||||
model checking methods and program synthesis algorithms cannot deal
|
||||
with large programs~\cite{Gulwani2010}. Another drawback that is worth
|
||||
mentioning is that search-based methods tend to be quite
|
||||
fragile. Those methods typically require inputs at every step to be
|
||||
extremely precise, or the algorithm would not perform as expected.
|
||||
|
||||
The second type of algorithms is based on learning. The idea of
|
||||
learning is to let machine improve its performance using data in
|
||||
solving a task and during the process learning-based methods are able
|
||||
to capture idioms that are essential in solving the problem. These
|
||||
idioms are typically hard to express or discover for humans. The large
|
||||
amount of data was not available online until around 2012 and after
|
||||
that researchers started applying learning-based methods to
|
||||
programming systems~\cite{mishne12, Raychev2015, Raychev2014,
|
||||
murali2017neural, murali2017bayesian, balog2016deepcoder}. The
|
||||
biggest advantage brought by ``big data'' or ``big code'' is that it
|
||||
allows researchers to find idioms that reduce the search space
|
||||
significantly by using machine learning techniques. Examples include
|
||||
relationships between variable names and their semantics information
|
||||
and API call sequence idioms. These idioms cannot be made available
|
||||
without people analyzing a large amount of data. Another advantage
|
||||
compared to search-based method is its robustness and this is because
|
||||
machine learning algorithms tend to use a large amount of data where
|
||||
small noises are suppressed. Even though data-driven programming
|
||||
systems are quite impactful, learning-based methods are not as
|
||||
accessible as search-based methods because learning-based methods tend
|
||||
to require data. In order to make learning-based algorithms perform
|
||||
well in practice, a large amount of data is typically required and
|
||||
this also leads to a large consumption on time and computation
|
||||
resources which might not be available for everyone.
|
||||
|
||||
In this thesis, we propose two additional corpus-driven systems that
|
||||
aim to automate the process of software reuse and software
|
||||
refactoring. In the current literature, the problem of software reuse
|
||||
and refactoring have been both considered, but no systems can fully
|
||||
automate software reuse and refactoring and some state-of-the-art
|
||||
tools~\cite{Barr2015, balaban2005refactoring} still requires human to
|
||||
provide additional hints. By using a large code corpus, we claim that
|
||||
our systems can fully automate the process of software reuse and
|
||||
refactoring without human intervention, and our systems can accomplish
|
||||
the tasks efficiently and help human developers by boosting their
|
||||
program productivity.
|
||||
|
||||
\section{Program reuse via splicing}
|
||||
We first introduce {\em program splicing}, a programming system that
|
||||
helps human developers by automating the process of software
|
||||
reuse. The most popular workflow nowadays consists of copying, pasting,
|
||||
and modifying code available online and the reason for its domination
|
||||
is that it is relatively easy to execute with the help of internet
|
||||
search. However, this process inherits the drawbacks from
|
||||
programming. This process requires extreme precision and carefulness
|
||||
from programmers similar to normal programming. When a software reuse
|
||||
task happens in a large and complicated software system, the cost of
|
||||
making mistakes and spending enormous time on repairing might exceed
|
||||
the benefit, let alone the fact that programmers sometimes do not even
|
||||
try to fully understand the code they bring in from the internet so
|
||||
long as it appears to work under their specific software
|
||||
environment. This might impose a threat to their future software
|
||||
development progress.
|
||||
|
||||
Existing techniques that inspire the idea of our method can be divided
|
||||
into two areas, search-based program synthesis techniques and
|
||||
data-driven methods. The problem of program synthesis has been studied
|
||||
for decades and researchers have been applying search-based methods to
|
||||
tackle the problem for several decades~\cite{Pnueli1989, lezama06,
|
||||
Srivastava2012, Alur2015, Feser2015, yaghmazadeh2016}. The main
|
||||
benefit with respect to this work comes from the fact that
|
||||
search-based method can produce results that require precision. This
|
||||
is quite crucial when we aim to generate code snippets that needs to
|
||||
interact with pre-written software pieces and examples might include
|
||||
matching variables that are semantically similar or
|
||||
equivalent. However, the problem with search-based method is that it
|
||||
does not scale well into handling large inputs, which lead to large
|
||||
search spaces, due to the complexity of the problem, and this is the
|
||||
main reason why one of the competing system, $\mu$Scalpel, is not as
|
||||
efficient as our splicing method. To alleviate the scalability
|
||||
problem, people have proved that using ``big data'' can be quite
|
||||
effective~\cite{Raychev2015, Raychev2014, raychev2016,
|
||||
balog2016deepcoder, hindle2012naturalness}. Even though our splicing
|
||||
method does not use any statistical method, we still reduce our search
|
||||
space significantly and achieve high efficiency by relying on using natural
|
||||
language to search a big code corpus~\cite{kashyap17}.
|
||||
|
||||
One of our novelty in this work is that we combine the ideas from
|
||||
search-based methods and data-driven methods. To use our programming
|
||||
system for program reuse, a programmer starts by writing a ``draft''
|
||||
that mixes unfinished code, natural language comments, and correctness
|
||||
requirements. A program synthesizer that interacts with a large,
|
||||
searchable database of program snippets is used to automatically
|
||||
complete the draft into a program that meets the requirements. The
|
||||
synthesis process happens in two stages. First, the synthesizer
|
||||
identifies a small number of programs in the
|
||||
database~\cite{zou2018plinycompute} that are relevant to the synthesis
|
||||
task. Next it uses an enumerative search to systematically fill the
|
||||
draft with expressions and statements from these relevant
|
||||
programs. The resulting program is returned to the programmer, who can
|
||||
modify it and possibly invoke additional rounds of synthesis.
|
||||
|
||||
We present an implementation of program splicing, called \system, for
|
||||
the Java programming language. \system uses a corpus of over 3.5
|
||||
million procedures from an open-source software repository. Our
|
||||
evaluation uses the system in a suite of everyday programming tasks,
|
||||
and includes a comparison with a state-of-the-art competing
|
||||
approach~\cite{Barr2015} as well as a user study. The results point to
|
||||
the broad scope and scalability of program splicing and indicate that
|
||||
the approach can significantly boost programmer productivity.
|
||||
|
||||
\section{API refactoring using natural language and API synthesizer}
|
||||
Software refactoring typically involves reconstructing existing source
|
||||
code without modifying the functionality, and it is important and
|
||||
almost a daily routine that programmer will perform to keep their
|
||||
software projects clean and organized by constructing better
|
||||
abstractions, deleting duplicate codes, breaking down a big
|
||||
functionalities into small pieces that are universally applicable and
|
||||
etc. Software system maintenance is extremely crucial, because a
|
||||
software system can easily deteriorate and become obsolete and useless
|
||||
if maintenance is not done properly and regularly, especially when the
|
||||
external libraries it uses and the other underlying software systems
|
||||
it depends on evolve rapidly nowadays. After several decades of
|
||||
software development, most professional programmers have realized the
|
||||
importance of software refactoring, and software refactoring has been
|
||||
used heavily and regularly in the software industry. Similar to
|
||||
software reuse, software refactoring also inherits the drawbacks from
|
||||
programming. It again requires extreme accuracy from programmers, and
|
||||
programmers tend to make mistakes when they deal with large and
|
||||
complex software systems which typically involves keeping tracking of
|
||||
tens or even hundreds of variables and function components.
|
||||
|
||||
In this thesis, we focus on refactoring Application Programming
|
||||
Interface (API) call sequences. An API consists of all the definitions
|
||||
and usages of the resources available for external use from a software
|
||||
system, and almost all software systems are built using various APIs
|
||||
from other software systems nowadays. The process of API refactoring
|
||||
mainly consists of changing the API call sequence defined in one
|
||||
library into another sequence defined in another library. The benefit
|
||||
of performing API refactoring is identical to general software
|
||||
refactoring, but API refactoring has its specific benefits. The first
|
||||
specific benefit allows programmers to reuse obsolete programs in which
|
||||
programmers can adopt an obsolete programs into the existing
|
||||
programming environment. Another benefit is that it can enhance the
|
||||
performance of existing programs by refactoring the existing program
|
||||
into another program that uses advanced libraries and platforms which
|
||||
typically have better performance.
|
||||
|
||||
The main difficulty of API refactoring comes from discovering
|
||||
semantically equivalent API calls between two libraries and how to
|
||||
instantiate the new API calls using the environment's variables so
|
||||
that the resulting API call sequence does not alter the functionality
|
||||
of the original API call sequence. One of the earliest
|
||||
work~\cite{balaban2005refactoring} that aims to help API refactoring
|
||||
requires human interventions. The user of the system needs to formally
|
||||
specify the mapping between the API calls in two libraries, and the
|
||||
system only focuses on refactoring \emph{individual} API calls instead
|
||||
of refactoring sequences. Subsequent research in the area of API
|
||||
refactoring has been limited to the problem of API mapping or API
|
||||
translation. The goal is to discover two API calls that are
|
||||
semantically equivalent. Two types of methods were developed to solve
|
||||
the problem of API translation. The first one involves aligning two
|
||||
API call sequences using a statistical model and the translations can
|
||||
be extracted from the alignment
|
||||
results~\cite{gokhale2013inferring}. This alignment method allows
|
||||
people to find not only one-to-one API translations but also
|
||||
one-to-many API translations, but the downside is that this method
|
||||
requires a large amount of API call sequences to train the underlying
|
||||
statistical method. Another method relies on natural language features
|
||||
such as Javadoc to find semantically equivalent API
|
||||
calls~\cite{pandita2015discovering, nguyen2016mapping,
|
||||
zhong2009inferring}. Since Javadoc contains descriptions on the
|
||||
nature of API calls, correct translations can be found by calculating
|
||||
the similarities between the Javadoc texts of two API calls, and
|
||||
calculating similarities can easily be done using a standard
|
||||
\verb|Word2Vec| model which is able to calculate semantic similarities
|
||||
between words. The only drawback of using natural language features as
|
||||
the main glue is that it is difficult to discover one-to-many API
|
||||
translations.
|
||||
|
||||
In this thesis, we propose a new algorithm that automates the process
|
||||
of API refactoring by combining the natural language
|
||||
technique~\cite{pandita2015discovering} and an state-of-the-art API
|
||||
call sequence synthesizer called
|
||||
\verb|Bayou|~\cite{murali2017neural}. The input to our algorithm
|
||||
includes an API call sequence and the name of the destination library,
|
||||
and our algorithm can produce another semantically equivalent sequence
|
||||
that uses only the API calls defined in the destination library. We
|
||||
solves the problem in two steps. We first translate the input API call
|
||||
sequences into a set of stand-alone API calls defined in the
|
||||
destination library using natural language features as the main
|
||||
driver~\cite{pandita2015discovering, nguyen2016mapping}. Then we feed
|
||||
the stand-alone API calls into a API sequence synthesizer called
|
||||
\emph{Bayou}~\cite{murali2017neural} which in turn synthesizes a
|
||||
complete sequence of API calls. We have designed a series of benchmark
|
||||
problems to evaluate the accuracy of our API refactoring algorithm,
|
||||
and here the accuracy is defined as the percentage of corrected
|
||||
generated API calls. The results show that our algorithm is able to
|
||||
refactor API call sequences accurately, given that the two involved
|
||||
libraries have similar coding practices and the input sequence is not
|
||||
rare in the training data.
|
||||
Reference in New Issue
Block a user