obsidian-yanxin/documents/academic/rice_engi601/lu_slides.md

---
type: academic
category: academic
person: Yanxin Lu
date: 2018
source: lu_slides.pdf
---

# API Refactoring Using Natural Language and Program Synthesis

**Yanxin Lu, Rice University**
**Swarat Chaudhuri, Rice University**
**Christopher Jermaine, Rice University**

---

## Slide 1: Title Slide

(blank/title page)

---

## Slide 2: Title

API refactoring using natural language and program synthesis

Yanxin Lu, Rice University
Swarat Chaudhuri, Rice University
Christopher Jermaine, Rice University

---

## Slide 3: Software Refactoring

- Library/platform upgrade
- Obsolete code reuse

### Example (Before - SSHJ):

```java
SSHClient ssh = new SSHClient();
SFTPClient ftp = ssh.newSFTPClient();
ssh.authPassword(username, password);
ssh.connect(host);
ftp.ls(path);
ftp.close();
```

(Arrow: API refactoring)

### Example (After - Apache):

```java
FTPClient f = new FTPClient();
f.connect(host);
f.login(username, password);
FTPFile[] files = f.listFiles(path);
f.disconnect();
```

- Almost as hard as coding

---

## Slide 4: Problem

Can we automate the process of API refactoring using program synthesis?

---

## Slide 5: Contribution

- Combination of two existing techniques
  - API translation
    - Natural language
  - API sequence synthesis
    - Complete API sequence
    - Bayou

---

## Slide 6: Related Work

- API mapping
  - Natural language
  - Sequence alignment
- API sequence synthesis
  - Learning from the web (SWIM)
  - Bayou

References:
- Raghothaman, Mukund, Yi Wei, and Youssef Hamadi. "SWIM: Synthesizing What I Mean-Code Search and Idiomatic Snippet Synthesis." *Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on.* IEEE, 2016.
- Murali, Vijayaraghavan, et al. "Neural Sketch Learning for Conditional Program Generation." *arXiv preprint arXiv:1703.05698* (2017).

---

## Slide 7: Algorithm

- A() --> (API translation) --> a() --> (API synthesis) --> a()
- B() --> b() --> b()
- C() --> c() --> c()
- D() --> d() --> d()
- E() --> e() --> e()

---

## Slide 8: Algorithm (Highlighted: API translation)

Same diagram as Slide 7, with the "API translation" step highlighted in a red box.

---

## Slide 9: API Translation

### Architecture:

1. All relevant libraries and Java 8 --> Text extraction --> Javadoc cards (e.g., "clean / Parse an html document string", "isValid / Test if the input body HTML has only tags and attributes ...")
2. Train a word2vec model
3. Input API calls: A(), B(), C(), D(), E() --> Translator --> Output: a(), b(), c(), d(), e()

---

## Slide 10: Word2Vec Model

- Captures some degree of semantic information

| Query word | Similar words |
|---|---|
| int | integer, float, long, double, short |
| ftp | nntp, smtp, secret, pixmap, out-of-synch |
| button | rollover, radio, tooltip, checkbox, click |
| index | IndexFrom, MenuIndex, ListIndex, occurrence, nth |
| stream | InputStream, StreamB, BufferTest, console, AccessFile |
| image | gif, animation, texture, BufferedImage, RenderedImage |
| email | bcc, recipient, sender, addresse, mail |
| vector | scalar, dense, product, kernel, matrix |

---

## Slide 11: Pair-wise API Similarities

Bipartite graph between Apache and SSHJ APIs:

Apache side: connect, login, list, close
SSHJ side: auth, connect, ls, disconnect

Lines connecting each Apache API to each SSHJ API (showing pair-wise similarity scores), with thicker lines indicating stronger matches (e.g., connect-connect, login-auth, list-ls, close-disconnect).

---

## Slide 12: API Similarity

Comparing two APIs:

Left API:
- Name: clean
- Return: TagNode
- Description: "Parse an html document string"

Right API:
- Name: parse
- Return: Document
- Description: "Parse HTML into a Document."

Similarity scores between components:
- Name similarity: 0.7
- Return type similarity: 0.3
- Description similarity: 0.8

- Similarity = w1 * 0.5 + w2 * 0.3 + w3 * 0.8
- More weight on description similarity
- Word list similarity - bipartite matching

---

## Slide 13: Algorithm (Highlighted: API synthesis)

Same diagram as Slide 7, with the "API synthesis" step highlighted in a red box.

---

## Slide 14: API Sequence Synthesis

- Input: APIs in the target library
  - Stand-alone APIs
  - Might miss a few APIs
- Output: a complete API call sequence
- Bayou

Reference: Murali, Vijayaraghavan, et al. "Neural Sketch Learning for Conditional Program Generation." *arXiv preprint arXiv:1703.05698* (2017).

---

## Slide 15: API Sequence Synthesis (with Bayou example)

- Input: APIs in the target library
  - Stand-alone APIs
  - Might miss a few APIs
- Output: a complete API call sequence
- Bayou

### Bayou Input (evidence):

```
void read_href(String content,
               String selector,
               String attr,
               Evaluator _arg01) {
  ///call:parse type:Jsoup call:select
  ///call:first
}
```

### Bayou Output:

```java
void read_href(String content,
               String selector,
               String attr,
               Evaluator _arg01) {
  Elements e1;
  Document d1;
  Elements e3;
  Element e2;
  d1 = Jsoup.parse(content);
  e1 = d1.select(selector);
  e2 = e1.first();
  e3 = Collector.collect(_arg01, e2);
}
```

Evidence: types, calls

---

## Slide 16: Bayou

### Architecture:

1. Code corpus --> Evidence extraction --> Evidence/code pairs (e.g., "call:readLine type:FileReader type:BufferedReader", "type:Iterator call:next call:remove")

Example extracted code:

```java
void remove(List<String> list) {
    Iterator<String> i1;
    boolean b1;
    i1 = list.iterator();
    while ((b1 = i1.hasNext())) {
        i1.remove();
    }
    return;
}
```

2. Training (neural network with distribution curve)

3. Input evidence: `call:parse type:Jsoup call:select call:first` --> Trained model --> Output:

```java
void read_href(String content,
               String selector,
               String attr,
               Evaluator _arg01) {
  Elements e1;
  Document d1;
  Elements e3;
  Element e2;
  d1 = Jsoup.parse(content);
  e1 = d1.select(selector);
  e2 = e1.first();
  e3 = Collector.collect(_arg01, e2);
}
```

---

## Slide 17: Evaluation

- Accuracy - percentage of correctly generated API calls
- 75% accuracy on most benchmark problems

Bar chart showing "Accuracy w/o params" and "Accuracy" for benchmark tasks:

CSV read, CSV write, CSV database, CSV delimiter, email login, email check, email send, email delete, FTP list, FTP login, FTP upload, FTP download, FTP delete, HTML scraping, HTML add node, HTML rm attr, HTML parse, HTML title, HTML write, HTTP get, HTTP post, HTTP server, NLP sentence, NLP token, NLP tag, NLP stem, ML classification, ML regression, ML cluster, ML neural network, graphics, gui, pdf read, pdf write, word read, word write

---

## Slide 18: Translation Failure

Top chart: Same accuracy bar chart as Slide 17, with red boxes highlighting problem areas: HTML title, HTML write, gui, pdf read, pdf write, word read, word write.

Bottom chart: "Translation" accuracy at different levels (Translation-1, Translation-3, Translation-5) for the same benchmark tasks, showing that translation accuracy is the bottleneck for the highlighted tasks.

---

## Slide 19: Rare Sequence

Top chart: Same accuracy bar chart, with red boxes highlighting: email send, HTML scraping, gui, pdf read, pdf write, word read, word write.

Bottom chart: "Min Bayou calls" for each benchmark task, showing that tasks with rare sequences (highlighted in red) have fewer matching Bayou training examples, leading to lower accuracy.

---

## Slide 20: API Refactoring (Conclusion)

- Effective method that automates the process of API refactoring
- Combination of two techniques
  - API call translation
  - API call sequence synthesizer
- Does not work when
  - Terminologies are different
  - Rare sequence