Detecting code duplications in the NPM community

dc.contributor.authorLiu, Hanwenen
dc.contributor.departmentComputer Science and Applicationsen
dc.date.accessioned2021-09-10T08:00:30Zen
dc.date.available2021-09-10T08:00:30Zen
dc.date.issued2021-09-09en
dc.description.abstractIn the modern software development process, it has become a very mainstream practice to build software projects on top of third-party packages to simplify the development process. In this development method, it is quite common to copy existing code or files in other libraries instead of making regular calls. Although this approach can reduce the project's dependence on other libraries and make the project more streamlined, it also causes difficulties in maintenance and understanding. The ignorance of code duplication by third-party library community can even be exploited for malicious purpose, such as typo-squatting attack. This paper serves as a starting point to analyze the growing code duplication issues surrounding third-party open source packages, and what is the root cause of code duplication. In this paper, I conducted code duplication-related research based on some popular packages in the third-party open source packages community, the NPM community, by using the tokenizer tool and the code comparison tool to compute the code similarity, quantitatively analyzed the prevalence of code duplication in the NPM community, and did some related experiments based on this similarity. In the experiments, I found that code duplication is very common in NPM community: 17.1% of all the files have 1-93 similar file in other package when the threshold of similar file is set to 0.5. 29.3% of all the packages has at least one "similar package" when the threshold of similar package is set to 0.5. In all the 951 similar package pairs, 33.9% of them, 323 package pairs comes from the same domain. The ultimate goal of this paper is to promote the awareness of the commonness and the importance of code duplication in the third-party package community and the reasonable use of code duplication by developers in the project development.en
dc.description.abstractgeneralIn the modern software development process, developers often call other people's completed code to build their own programs. There are generally two ways to do this: indirectly call other people's code through "import" or similar instructions in the program, or directly copy and paste other people's code and make slight modifications. The second method can make the program more independent and easy to use, but the code duplication problem caused by this method also has great security risks.This paper serves as a starting point to analyze the growing code duplication issues, and what is the root cause of code duplication. In this paper, I conducted code duplication-related research based on some popular code packages in the NPM community.I used some tools to compute a value to define how different codes are similar to each other, quantitatively analyzed the prevalence of code duplication in the NPM community, and did some related experiments based on this similarity. In the experiments, I found that code duplication is very common in the NPM community: 17.1% of all the files have 1-93 similar file in other package, and 29.3% of all the package have at least one "similar package", when the definition of similar files and packages are not that "strict".In all the 951 similar package pairs, 33.9% of them, 323 package pairs comes from the same domain. The ultimate goal of this paper is to promote the awareness of the commonness and the importance of code duplication in the third-party package community and the reasonable use of code duplication by developers in the project development.en
dc.format.mediumETDen
dc.identifier.othervt_gsexam:32241en
dc.identifier.urihttp://hdl.handle.net/10919/104972en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectCode duplicationen
dc.subjectNPMen
dc.subjectClusteringen
dc.subjectCode Similarityen
dc.titleDetecting code duplications in the NPM communityen
dc.typeThesisen
thesis.degree.disciplineComputer Science and Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Liu_H_T_2021.pdf
Size:
1.06 MB
Format:
Adobe Portable Document Format
Collections