How Do Java Developers Reuse StackOverflow Answers in Their GitHub Projects?

dc.contributor.authorChen, Juntongen
dc.contributor.committeechairMeng, Naen
dc.contributor.committeememberBrown, Dwayne Christianen
dc.contributor.committeememberGao, Pengen
dc.contributor.departmentComputer Science and Applicationsen
dc.date.accessioned2022-09-10T08:00:20Zen
dc.date.available2022-09-10T08:00:20Zen
dc.date.issued2022-09-09en
dc.description.abstractStackOverflow (SO) is a widely used question-and-answer (QandA) website for software developers and computer scientists. GitHub is a code hosting platform for collaboration and version control. Popular software libraries are open-source and published in repositories on GitHub. Preliminary observation shows developers cite SO questions in their GitHub repository. This observation inspired us to explore the relationship between SO posts and GitHub repositories; to help software developers better understand the characterization of SO answers that are reused by GitHub projects. For this study, we conducted an empirical study to investigate the SO answers reused by Java code from public GitHub projects. We used a hybrid approach to ensure precise results: code clone detection, keyword-based search, and manual inspection. This approach helped us identify the leveraged answers from developers. Based on the identified answers, we further investigated the topics of the discussion threads; answer characteristics (e.g., scores, ages, code lengths, and text lengths) and developers' reuse practices. We observed both reused and unused answers. Compared with unused answers, We found that the reused answers mostly have higher scores, longer code, and longer plain text explanations. Most reused answers were related to implementing specific coding tasks. In one of our observations, 9% (40/430) of scenarios, developers entirely copied code from one or multiple answers of an SO discussion thread. Furthermore, we observed that in the other 91% (390/430) of scenarios, developers only partially reused code or created brand new code from scratch. We investigated 130 SO discussion threads referred to by Java developers in 356 GitHub projects. We then arranged those into five different categories. Our findings can help the SO community have a better distribution of programming knowledge and skills, as well as inspire future research related to SO and GitHub.en
dc.description.abstractgeneralStackOverflow (SO) is a widely used question-and-answer (QandA) website for software developers and computer scientists. GitHub is a code hosting platform for collaboration and version control. Popular software libraries are open-source and published in repositories on GitHub. Preliminary observation shows developers cite SO questions in their GitHub repository. This observation inspired us to explore the relationship between SO posts and GitHub repositories; to help software developers better understand the characterization of SO answers that are reused by GitHub projects. Our objectives are to guide SO answerers to help developers better; help tool builders understand how SO answers shape software products. Thus, we conducted an empirical study to investigate the SO answers reused by Java code from public GitHub projects. We used a hybrid approach to refine our dataset and to ensure precise results. Our hybrid approach includes three steps. The first step is code clone detection. We compared two code snippets with a code clone detection tool to find the similarity. The second step is a keyword-based search. We created multiple keywords to search within GitHub code to find the referenced answers missed by step one. Lastly, we manually inspected the outputs of both step one and two to ensure zero false positives in our data. This approach helped us identify the leveraged answers from developers. Based on the identified answers, we further investigated the topics of the discussion threads, answer characteristics, and developers' reuse practices. We observed both reused and unused answers. Compared with unused answers, We found that the reused answers mostly have higher scores, longer code, and longer plain text explanations. Most reused answers were related to implementing specific coding tasks. In one of our observations, 9% of scenarios, developers entirely copied code from one or multiple answers of an SO discussion thread. Furthermore, we observed that in the other 91% of scenarios, developers only partially reused code or created brand new code from scratch. Our findings can help the SO community have a better distribution of programming knowledge and skills, as well as inspire future research related to SO and GitHub.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.othervt_gsexam:35469en
dc.identifier.urihttp://hdl.handle.net/10919/111786en
dc.language.isoenen
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectEmpiricalen
dc.subjectStackOverflowen
dc.subjectGitHuben
dc.subjectanswer reuseen
dc.subjectclone detectionen
dc.titleHow Do Java Developers Reuse StackOverflow Answers in Their GitHub Projects?en
dc.typeThesisen
thesis.degree.disciplineComputer Science and Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Chen_J_T_2022.pdf
Size:
527.49 KB
Format:
Adobe Portable Document Format

Collections