Detecting code duplications in the NPM community

Files
TR Number
Date
2021-09-09
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract

In the modern software development process, it has become a very mainstream practice to build software projects on top of third-party packages to simplify the development process. In this development method, it is quite common to copy existing code or files in other libraries instead of making regular calls. Although this approach can reduce the project's dependence on other libraries and make the project more streamlined, it also causes difficulties in maintenance and understanding. The ignorance of code duplication by third-party library community can even be exploited for malicious purpose, such as typo-squatting attack. This paper serves as a starting point to analyze the growing code duplication issues surrounding third-party open source packages, and what is the root cause of code duplication. In this paper, I conducted code duplication-related research based on some popular packages in the third-party open source packages community, the NPM community, by using the tokenizer tool and the code comparison tool to compute the code similarity, quantitatively analyzed the prevalence of code duplication in the NPM community, and did some related experiments based on this similarity. In the experiments, I found that code duplication is very common in NPM community: 17.1% of all the files have 1-93 similar file in other package when the threshold of similar file is set to 0.5. 29.3% of all the packages has at least one "similar package" when the threshold of similar package is set to 0.5. In all the 951 similar package pairs, 33.9% of them, 323 package pairs comes from the same domain. The ultimate goal of this paper is to promote the awareness of the commonness and the importance of code duplication in the third-party package community and the reasonable use of code duplication by developers in the project development.

Description
Keywords
Code duplication, NPM, Clustering, Code Similarity
Citation
Collections