Investigating the Reproducbility of NPM packages

Files
TR Number
Date
2020-05-19
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract

The meteoric increase in the popularity of JavaScript and a large developer community has led to the emergence of a large ecosystem of third-party packages available via the Node Package Manager (NPM) repository which contains over one million published packages and witnesses a billion daily downloads. Most of the developers download these pre-compiled published packages from the NPM repository instead of building these packages from the available source code. Unfortunately, recent articles have revealed repackaging attacks to the NPM packages. To achieve such attacks the attackers primarily follow three steps – (1) download the source code of a highly depended upon NPM package, (2) inject malicious code, and (3) then publish the modified packages as either misnamed package (i.e., typo-squatting attack) or as the official package on the NPM repository using compromised maintainer credentials. These attacks highlight the need to verify the reproducibility of NPM packages. Reproducible Build is a concept that allows the verification of build artifacts for pre-compiled packages by re-building the packages using the same build environment configuration documented by the package maintainers. This motivates us to conduct an empirical study (1) to examine the reproducibility of NPM packages, (2) to assess the influence of any non-reproducible packages, and (3) to explore the reasons for non-reproducibility. Firstly, we downloaded all versions/releases of 226 most-depended upon NPM packages, and then built each version with the available source code on Github. Secondly, we applied diffoscope, a differencing tool to compare the versions we built against the version downloaded from the NPM repository. Finally, we did a systematic investigation of the reported differences. At least one version of 65 packages was found to be non-reproducible. Moreover, these non- reproducible packages have been downloaded millions of times per week which could impact a large number of users. Based on our manual inspection and static analysis, most reported differences were semantically equivalent but syntactically different. Such differences result due to non-deterministic factors in the build process. Also, we infer that semantic differences are introduced because of the shortcomings in the JavaScript uglifiers. Our research reveals challenges of verifying the reproducibility of NPM packages with existing tools, reveal the point of failures using case studies, and sheds light on future directions to develop better verification tools.

Description
Keywords
Empirical, JavaScript, NPM packages, Reproducibility, Software Security, Software Engineering
Citation
Collections