Registered user since Mon 6 May 2019
Contributions
View general profile
Registered user since Mon 6 May 2019
Contributions
Research Papers
Wed 12 Oct 2022 17:40 - 18:00 at Ballroom C East - Technical Session 19 - Formal Methods and Models I Chair(s): Michalis FamelisCode clone detection aims to find functionally similar code fragments, which is becoming more and more important in the field of software engineering. Many code clone detection methods have been proposed, among which tree-based methods are able to handle semantic code clones. However, these methods are difficult to scale to big code due to the complexity of tree structures. In this paper, we design \emph{Amain}, a scalable tree-based semantic code clone detector by building Markov chains models. Specifically, we propose a novel method to transform the complex original tree into simple Markov chains and compute the similarity of all states in these chains. After obtaining all similarity scores, we feed them into a machine learning classifier to train a code clone detector. To examine the effectiveness of \emph{Amain}, we evaluate it on two widely used datasets namely Google Code Jam and BigCloneBench. Experimental results show that \emph{Amain} is superior to five state-of-the-art code clone detection tools (\ie \emph{SourcererCC}, \emph{Deckard}, \emph{RtvNN}, \emph{ASTNN}, and \emph{SCDetector}). Furthermore, compared to a recent tree-based code clone detector \emph{ASTNN}, \emph{Amain} is more than 160 times faster in predicting semantic code clones.
Research Papers
Thu 13 Oct 2022 17:10 - 17:30 at Banquet A - Technical Session 31 - Code Similarities and Refactoring Chair(s): Hua MingCode clone detection is an important research problem that has attracted wide attention in software engineering. Many methods have been proposed for detecting code clone, among which text-based and token-based approaches are scalable but lack consideration of code semantics, thus resulting in the inability to detect semantic code clones. Methods based on intermediate representations of codes can solve the problem of semantic code clone detection. However, graph-based methods are not practicable due to code compilation, and existing tree-based approaches are limited by the scale of trees for scalable code clone detection.
In this paper, we propose \emph{TreeCen}, a scalable tree-based code clone detector, which satisfies scalability while detecting semantic clones effectively. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and transform it into a simple graph representation (\ie tree graph) according to the node type, rather than using traditional heavyweight tree matching. We then treat the tree graph as a social network and adopt centrality analysis on each node to maintain the tree details. By this, the original complex tree can be converted into a 72-dimensional vector while containing comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and use it to find code clones. We conduct comparative evaluations on effectiveness and scalability. The experimental results show that \emph{TreeCen} maintains the best performance of the other six state-of-the-art methods (\ie \emph{SourcererCC}, \emph{RtvNN}, \emph{DeepSim}, \emph{SCDetector}, \emph{Deckard}, and \emph{ASTNN}) with F1 scores 0.99 and 0.95 on BigCloneBench and Google Code Jam datasets, respectively. In terms of scalability, \emph{TreeCen} is about 79 times faster than another state-of-the-art tree-based semantic code clone detector (\ie \emph{ASTNN}).