How Readable is Model Generated Code? Examining Readability and Visual Inspection of GitHub Copilot
Virtual
Background: Recent advancements in large language models have motivated the practical use of such models in code generation and program synthesis. However, little is known about the effects of such tools on code readability and visual attention in practice. Objective: In this paper, we focus on GitHub Copilot to address the issues of readability and visual inspection of model generated code. Readability and low complexity are vital aspects of good source code, and visual inspection of generated code is important in light of automation bias. Method: Through a human experiment (n=21) we compare model generated code to code written completely by human programmers. We use a combination of static code analysis and human evaluators to assess code readability, and we use eye tracking to assess the visual inspection of code. Results: Our results suggest that model generated code is comparable in complexity and readability to code written entirely by human programmers. At the same time, eye tracking data suggests, to a statistically significant level, that programmers direct less visual attention to model generated code. Conclusion: Our findings highlight that reading code is more important than ever, and programmers should beware of complacency and automation bias with model generated code.
Rank Learning-Based Code Readability Assessment with Siamese Neural Networks
Virtual
Automatically assessing code readability is a relatively new challenge that has attracted growing attention from the software engineering community. In this paper, we outline the idea to regard code readability assessment as a learning-to-rank task. Specifically, we design a pairwise ranking model with siamese neural networks, which takes as input a code pair and outputs their readability ranking order. We have evaluated our approach on three publicly available datasets. The result is promising, with an accuracy of 83.5%, a precision of 86.1%, a recall of 81.6%, an F-measure of 83.6% and an AUC of 83.4%.