Chaozheng Wang

Predicting Compilation Resources for Adaptive Build in an Industrial Setting

Development teams in large companies often maintain a huge codebase whose build time can be painfully long in a single machine. To reduce the build time, tools such as Bazel and distcc are used to build the codebase in a distributed fashion. However, in the process of a distributed building, it is normal that certain remote nodes crash due to two types of errors: Out Of Memory (OOM) and Deadline Exceeded (DE) errors. These crashes lead to a time-consuming rebuild, which is also a problem faced by WeiXin Group (WXG) of Tencent Inc., the company that created WeChat. Since existing tools cannot help avoid the OOM and DE errors, we propose PCRLinear, which predicts the memory and time requirement of a C++ file, allowing the original distributed build system to schedule compilation adaptively according to the prediction. Our experiments show that PCRLinear reduces the OOM and DE errors to zero and demonstrates a significant average build performance improvement of 30%.

Junhao Hu

Peking University

The Chinese University of Hong Kong

Hailiang Huang

Tencent Inc.

Huang Luo

Tencent Inc.

Yu Jin

Tencent Inc.

Yuetang Deng

Tencent

China

Tao Xie

Peking University

China

Generative Type Inference for Python

Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. Existing type inference approaches can be generally grouped into three categories, i.e., rule-based, supervised, and cloze-style approaches. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems caused by dynamic features and external calls. Supervised type inference approaches, while feature-agnostic and able to mitigate the low coverage problem, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem by leveraging the general knowledge in powerful pre-trained code models. However, their performance is limited since they ignore the domain knowledge from static typing rules which reflect the inference logic. What is more, their predictions are not interpretable, hindering developers’ understanding and verification of the results.

This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy, which generates both explanations and type predictions in COT prompts. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match.

Pre-print File Attached

Yun Peng

Chinese University of Hong Kong

China

Chaozheng Wang

The Chinese University of Hong Kong

Wenxuan Wang

Chinese University of Hong Kong

China

Cuiyun Gao

Harbin Institute of Technology

China

Michael Lyu

The Chinese University of Hong Kong

2023

ASE

Predicting Compilation Resources for Adaptive Build in an Industrial Setting

Junhao Hu

Peking University

Chaozheng Wang

The Chinese University of Hong Kong

Hailiang Huang

Tencent Inc.

Huang Luo

Tencent Inc.

Yu Jin

Tencent Inc.

Yuetang Deng

Tencent

China

Tao Xie

Peking University

China

REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes

Chaozheng Wang

The Chinese University of Hong Kong

Li Zongjie

Hong Kong University of Science and Technology

China

Yun Peng

Chinese University of Hong Kong

China

Shuzheng Gao

Sirong Chen

Harbin Institute of Technology, Shenzhen

Shuai Wang

Hong Kong University of Science and Technology

China

Cuiyun Gao

Harbin Institute of Technology

China

Michael Lyu

The Chinese University of Hong Kong

Generative Type Inference for Python

Yun Peng

Chinese University of Hong Kong

China

Chaozheng Wang

The Chinese University of Hong Kong

Wenxuan Wang

Chinese University of Hong Kong

China

Cuiyun Gao

Harbin Institute of Technology

China

Michael Lyu

The Chinese University of Hong Kong