Not registered as user yet
Contributions
View general profile
Not registered as user yet
Contributions
Research Papers
Thu 13 Oct 2022 17:10 - 17:30 at Banquet A - Technical Session 31 - Code Similarities and Refactoring Chair(s): Hua MingCode clone detection is an important research problem that has attracted wide attention in software engineering. Many methods have been proposed for detecting code clone, among which text-based and token-based approaches are scalable but lack consideration of code semantics, thus resulting in the inability to detect semantic code clones. Methods based on intermediate representations of codes can solve the problem of semantic code clone detection. However, graph-based methods are not practicable due to code compilation, and existing tree-based approaches are limited by the scale of trees for scalable code clone detection.
In this paper, we propose \emph{TreeCen}, a scalable tree-based code clone detector, which satisfies scalability while detecting semantic clones effectively. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and transform it into a simple graph representation (\ie tree graph) according to the node type, rather than using traditional heavyweight tree matching. We then treat the tree graph as a social network and adopt centrality analysis on each node to maintain the tree details. By this, the original complex tree can be converted into a 72-dimensional vector while containing comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and use it to find code clones. We conduct comparative evaluations on effectiveness and scalability. The experimental results show that \emph{TreeCen} maintains the best performance of the other six state-of-the-art methods (\ie \emph{SourcererCC}, \emph{RtvNN}, \emph{DeepSim}, \emph{SCDetector}, \emph{Deckard}, and \emph{ASTNN}) with F1 scores 0.99 and 0.95 on BigCloneBench and Google Code Jam datasets, respectively. In terms of scalability, \emph{TreeCen} is about 79 times faster than another state-of-the-art tree-based semantic code clone detector (\ie \emph{ASTNN}).