vault backup: 2026-04-05 17:46:47

2026-04-05 17:46:47 -07:00
parent 5bba26ad4c
commit d192a2f023
179 changed files with 6740 additions and 18 deletions
--- a/documents/academic/rice_engi601/lu_writing.tex
+++ b/documents/academic/rice_engi601/lu_writing.tex
@@ -0,0 +1,287 @@
+\chapter{Introduction}
+\label{ch:intro}
+With the advancement in technologies such as artificial intelligence
+and also the expansions of high-tech companies, computer programming
+starts to become an important skill, and the demand for programmers
+has been growing dramatically in the past few years. The overall
+productivity has been boosted significantly thanks to the increasing
+number of programmers, but we still have not witnessed any boost in
+individual programming productivity.
+
+The most important reason is that programming is a difficult task. It
+requires programmers to deal with extremely low-level details in
+complex software projects, and it is almost inevitable for programmers
+to make small mistakes. People tend to assume that a piece of untested
+software does not function properly. To deal this the problem,
+software engineering techniques and formal method based techniques
+have been proposed to help facilitate programming. These techniques
+include various software engineering methodologies, design patterns,
+sophisticated testing methods, program repair algorithms, model
+checking algorithms and program synthesis methods. Some techniques
+such as software engineering methodologies, design patterns and unit
+testing have been practical and useful in boosting programming
+productivity and the industry has been adopting these techniques for
+more than a decade. The main reason for its popularity and longevity
+is that these techniques are quite easy to execute for average
+programmers. However, one dominant problem with these software
+engineering approaches is that they are not rigorous enough. If the
+specification of a method is not followed strictly, its benefits will
+tend to be hindered. Advance methods with more rules have been
+proposed, but the specification tend to be vague sometimes, which
+results in execution difficulties. 
+
+Some researchers switched their attention to applying formal methods
+to tackle the difficulties in programming. Methods such as model
+checking and program synthesis are much more rigorous than traditional
+software engineering techniques, and its performance and benefit is
+guaranteed once everything works accordingly. However, the impact of
+these formal methods technique is much less compared to the influence
+brought by the software engineering techniques, and the reason is that
+it is very likely that a formal method based approach will not work
+when large input is provided, because it will not terminate and
+produce any useful result due to its large search space. These large
+search spaces are inevitable, since formal methods techniques
+typically deal with extremely complex problems in theory. However,
+people have been trying to make formal method approaches practical by
+introducing additional hints~\cite{Srivastava2012} or by restricting
+the problem domain~\cite{Gulwani2011spreadsheet, Gulwani2011,
+  Gulwani2010}.
+
+With the advent of ``big data'', researchers started to pay attention
+to the problems that were considered difficult or impossible, and this
+has led to a significant advancement in the area of machine
+learning. Similarly, as more and more open source repositories such as
+\verb|Google Code|, \verb|Github| and \verb|SourceForge| have come
+online where thousands of software projects and their source code
+become available, researchers from the programming language community
+also started to consider using ``big code'' to tackle the problems
+that were considered difficult. With the help of ``big code'', many
+new techniques that use formal methods and aim to facilitate
+programming have been proposed. These techniques include program
+property prediction~\cite{mishne12, Raychev2015}, API sequence
+prediction~\cite{Raychev2014, murali2017neural, murali2017bayesian}
+and small program generations~\cite{balog2016deepcoder}, Researchers
+have showed that using data can indeed make the problem of synthesis
+feasible~\cite{balog2016deepcoder} and practical tools that can help
+human developers have also started to appear and programmers have
+started to use those in practice~\cite{Raychev2015, murali2017neural}.
+
+Two major types of algorithms were used in the current literature of
+applying formal methods to software engineering. The first type of
+algorithms is based on combinatorial search. Combinatorial search
+plays an important role in model checking and traditional program
+synthesis problems~\cite{Manna1992, rajeev2013, lezama06, Long2015,
+  Douskos2015, Pnueli1989, Alur2015, Feser2015, Gulwani2010}. The main
+idea is to first define a goal and also the steps for reaching the
+goal. Programmers can then let the computer to search for an
+solution. Typically heuristics are defined to reduce the search space
+and to speed up the search time. The advantages of search-based
+methods include (1) it is relatively easy to implement and it can be
+used to solve problems where no efficient solutions exist, (2)
+sometimes the algorithms can discover results that are hard to think
+about as humans because computers can easily discover solutions in a
+large search space quickly compared to humans, and (3) search-based
+methods can solve problems that requires precision and precision is
+typically required for analyzing computer programs. As SAT solvers and
+SMT solvers became sophisticated, people have been able to use those
+fast solvers to gain significant performance boost. The biggest
+drawback of search-based methods is its high algorithmic
+complexity. The search space grows indefinitely as the input size
+graduately increases and this is the main reason why most traditional
+model checking methods and program synthesis algorithms cannot deal
+with large programs~\cite{Gulwani2010}. Another drawback that is worth
+mentioning is that search-based methods tend to be quite
+fragile. Those methods typically require inputs at every step to be
+extremely precise, or the algorithm would not perform as expected.
+
+The second type of algorithms is based on learning. The idea of
+learning is to let machine improve its performance using data in
+solving a task and during the process learning-based methods are able
+to capture idioms that are essential in solving the problem. These
+idioms are typically hard to express or discover for humans. The large
+amount of data was not available online until around 2012 and after
+that researchers started applying learning-based methods to
+programming systems~\cite{mishne12, Raychev2015, Raychev2014,
+  murali2017neural, murali2017bayesian, balog2016deepcoder}. The
+biggest advantage brought by ``big data'' or ``big code'' is that it
+allows researchers to find idioms that reduce the search space
+significantly by using machine learning techniques. Examples include
+relationships between variable names and their semantics information
+and API call sequence idioms. These idioms cannot be made available
+without people analyzing a large amount of data. Another advantage
+compared to search-based method is its robustness and this is because
+machine learning algorithms tend to use a large amount of data where
+small noises are suppressed. Even though data-driven programming
+systems are quite impactful, learning-based methods are not as
+accessible as search-based methods because learning-based methods tend
+to require data. In order to make learning-based algorithms perform
+well in practice, a large amount of data is typically required and
+this also leads to a large consumption on time and computation
+resources which might not be available for everyone.
+
+In this thesis, we propose two additional corpus-driven systems that
+aim to automate the process of software reuse and software
+refactoring. In the current literature, the problem of software reuse
+and refactoring have been both considered, but no systems can fully
+automate software reuse and refactoring and some state-of-the-art
+tools~\cite{Barr2015, balaban2005refactoring} still requires human to
+provide additional hints. By using a large code corpus, we claim that
+our systems can fully automate the process of software reuse and
+refactoring without human intervention, and our systems can accomplish
+the tasks efficiently and help human developers by boosting their
+program productivity.
+
+\section{Program reuse via splicing}
+We first introduce {\em program splicing}, a programming system that
+helps human developers by automating the process of software
+reuse. The most popular workflow nowadays consists of copying, pasting,
+and modifying code available online and the reason for its domination
+is that it is relatively easy to execute with the help of internet
+search. However, this process inherits the drawbacks from
+programming. This process requires extreme precision and carefulness
+from programmers similar to normal programming. When a software reuse
+task happens in a large and complicated software system, the cost of
+making mistakes and spending enormous time on repairing might exceed
+the benefit, let alone the fact that programmers sometimes do not even
+try to fully understand the code they bring in from the internet so
+long as it appears to work under their specific software
+environment. This might impose a threat to their future software
+development progress.
+
+Existing techniques that inspire the idea of our method can be divided
+into two areas, search-based program synthesis techniques and
+data-driven methods. The problem of program synthesis has been studied
+for decades and researchers have been applying search-based methods to
+tackle the problem for several decades~\cite{Pnueli1989, lezama06,
+  Srivastava2012, Alur2015, Feser2015, yaghmazadeh2016}. The main
+benefit with respect to this work comes from the fact that
+search-based method can produce results that require precision. This
+is quite crucial when we aim to generate code snippets that needs to
+interact with pre-written software pieces and examples might include
+matching variables that are semantically similar or
+equivalent. However, the problem with search-based method is that it
+does not scale well into handling large inputs, which lead to large
+search spaces, due to the complexity of the problem, and this is the
+main reason why one of the competing system, $\mu$Scalpel, is not as
+efficient as our splicing method. To alleviate the scalability
+problem, people have proved that using ``big data'' can be quite
+effective~\cite{Raychev2015, Raychev2014, raychev2016,
+  balog2016deepcoder, hindle2012naturalness}. Even though our splicing
+method does not use any statistical method, we still reduce our search
+space significantly and achieve high efficiency by relying on using natural
+language to search a big code corpus~\cite{kashyap17}.
+
+One of our novelty in this work is that we combine the ideas from
+search-based methods and data-driven methods. To use our programming
+system for program reuse, a programmer starts by writing a ``draft''
+that mixes unfinished code, natural language comments, and correctness
+requirements. A program synthesizer that interacts with a large,
+searchable database of program snippets is used to automatically
+complete the draft into a program that meets the requirements. The
+synthesis process happens in two stages. First, the synthesizer
+identifies a small number of programs in the
+database~\cite{zou2018plinycompute} that are relevant to the synthesis
+task. Next it uses an enumerative search to systematically fill the
+draft with expressions and statements from these relevant
+programs. The resulting program is returned to the programmer, who can
+modify it and possibly invoke additional rounds of synthesis.
+
+We present an implementation of program splicing, called \system, for
+the Java programming language. \system uses a corpus of over 3.5
+million procedures from an open-source software repository. Our
+evaluation uses the system in a suite of everyday programming tasks,
+and includes a comparison with a state-of-the-art competing
+approach~\cite{Barr2015} as well as a user study. The results point to
+the broad scope and scalability of program splicing and indicate that
+the approach can significantly boost programmer productivity.
+
+\section{API refactoring using natural language and API synthesizer}
+Software refactoring typically involves reconstructing existing source
+code without modifying the functionality, and it is important and
+almost a daily routine that programmer will perform to keep their
+software projects clean and organized by constructing better
+abstractions, deleting duplicate codes, breaking down a big
+functionalities into small pieces that are universally applicable and
+etc. Software system maintenance is extremely crucial, because a
+software system can easily deteriorate and become obsolete and useless
+if maintenance is not done properly and regularly, especially when the
+external libraries it uses and the other underlying software systems
+it depends on evolve rapidly nowadays. After several decades of
+software development, most professional programmers have realized the
+importance of software refactoring, and software refactoring has been
+used heavily and regularly in the software industry. Similar to
+software reuse, software refactoring also inherits the drawbacks from
+programming. It again requires extreme accuracy from programmers, and
+programmers tend to make mistakes when they deal with large and
+complex software systems which typically involves keeping tracking of
+tens or even hundreds of variables and function components.
+
+In this thesis, we focus on refactoring Application Programming
+Interface (API) call sequences. An API consists of all the definitions
+and usages of the resources available for external use from a software
+system, and almost all software systems are built using various APIs
+from other software systems nowadays. The process of API refactoring
+mainly consists of changing the API call sequence defined in one
+library into another sequence defined in another library. The benefit
+of performing API refactoring is identical to general software
+refactoring, but API refactoring has its specific benefits. The first
+specific benefit allows programmers to reuse obsolete programs in which
+programmers can adopt an obsolete programs into the existing
+programming environment. Another benefit is that it can enhance the
+performance of existing programs by refactoring the existing program
+into another program that uses advanced libraries and platforms which
+typically have better performance.
+
+The main difficulty of API refactoring comes from discovering
+semantically equivalent API calls between two libraries and how to
+instantiate the new API calls using the environment's variables so
+that the resulting API call sequence does not alter the functionality
+of the original API call sequence. One of the earliest
+work~\cite{balaban2005refactoring} that aims to help API refactoring
+requires human interventions. The user of the system needs to formally
+specify the mapping between the API calls in two libraries, and the
+system only focuses on refactoring \emph{individual} API calls instead
+of refactoring sequences. Subsequent research in the area of API
+refactoring has been limited to the problem of API mapping or API
+translation. The goal is to discover two API calls that are
+semantically equivalent. Two types of methods were developed to solve
+the problem of API translation. The first one involves aligning two
+API call sequences using a statistical model and the translations can
+be extracted from the alignment
+results~\cite{gokhale2013inferring}. This alignment method allows
+people to find not only one-to-one API translations but also
+one-to-many API translations, but the downside is that this method
+requires a large amount of API call sequences to train the underlying
+statistical method. Another method relies on natural language features
+such as Javadoc to find semantically equivalent API
+calls~\cite{pandita2015discovering, nguyen2016mapping,
+  zhong2009inferring}. Since Javadoc contains descriptions on the
+nature of API calls, correct translations can be found by calculating
+the similarities between the Javadoc texts of two API calls, and
+calculating similarities can easily be done using a standard
+\verb|Word2Vec| model which is able to calculate semantic similarities
+between words. The only drawback of using natural language features as
+the main glue is that it is difficult to discover one-to-many API
+translations.
+
+In this thesis, we propose a new algorithm that automates the process
+of API refactoring by combining the natural language
+technique~\cite{pandita2015discovering} and an state-of-the-art API
+call sequence synthesizer called
+\verb|Bayou|~\cite{murali2017neural}. The input to our algorithm
+includes an API call sequence and the name of the destination library,
+and our algorithm can produce another semantically equivalent sequence
+that uses only the API calls defined in the destination library. We
+solves the problem in two steps. We first translate the input API call
+sequences into a set of stand-alone API calls defined in the
+destination library using natural language features as the main
+driver~\cite{pandita2015discovering, nguyen2016mapping}. Then we feed
+the stand-alone API calls into a API sequence synthesizer called
+\emph{Bayou}~\cite{murali2017neural} which in turn synthesizes a
+complete sequence of API calls. We have designed a series of benchmark
+problems to evaluate the accuracy of our API refactoring algorithm,
+and here the accuracy is defined as the percentage of corrected
+generated API calls. The results show that our algorithm is able to
+refactor API call sequences accurately, given that the two involved
+libraries have similar coding practices and the input sequence is not
+rare in the training data.