How Reliable is the Crowdsourced Knowledge of Security Implementation?
MetadataShow full item record
The successful crowdsourcing model and gamification design of Stack Overflow (SO) Q&A platform have attracted many programmers to ask and answer technical questions, regardless of their level of expertise. Researchers have recently found evidence of security vulnerable code snippets being possibly copied from SO to production software. This inspired us to study how reliable is SO in providing secure coding suggestions. In this project, we automatically extracted answer posts related to Java security APIs from the entire SO site. Then based on the known misuses of these APIs, we manually labeled each extracted code snippets as secure or insecure. In total, we extracted 953 groups of code snippets in terms of their similarity detected by clone detection tools, which corresponds to 785 secure answer posts and 644 insecure answer posts. Compared with secure answers, counter-intuitively, insecure answers has higher view counts (36,508 vs. 18,713), higher score (14 vs. 5), more duplicates (3.8 vs. 3.0) on average. We also found that 34% of answers provided by the so-called trusted users who have administrative privileges are insecure. Our finding reveals that there are comparable numbers of secure and insecure answers. Users cannot rely on community feedback to differentiate secure answers from insecure answers either. Therefore, solutions need to be developed beyond the current mechanism of SO or on the utilization of SO in security-sensitive software development.
General Audience Abstract
Stack Overflow (SO), the most popular question and answer platform for programmers today, has accumulated and continues accumulating tremendous question and answer posts since its launch a decade ago. Contributed by numerous users all over the world, these posts are a type of crowdsourced knowledge. In the past few years, they have been the main reference source for software developers. Studies have shown that code snippets in answer posts are copied into production software. This is a dangerous sign because the code snippets contributed by SO users are not guaranteed to be secure implementations of critical functions, such as transferring sensitive information on the internet. In this project, we conducted a comprehensive study on answer posts related to Java security APIs. By labeling code snippets as secure or insecure, contrasting their distributions over associated attributes such as post score and user reputation, we found that there are a significant number of insecure answers (644 insecure vs 785 secure in our study) on Stack Overflow. Our statistical analysis also revealed the infeasibility of differentiating between secure and insecure posts leveraging the current community feedback system (eg. voting) of Stack Overflow.
- Masters Theses