Deep Learning for Code Generation using Snippet Level Parallel Data

Jain, Aneesh

Deep Learning for Code Generation using Snippet Level Parallel Data

dc.contributor.author	Jain, Aneesh	en
dc.contributor.committeechair	Reddy, Chandan K.	en
dc.contributor.committeemember	Lourentzou, Ismini	en
dc.contributor.committeemember	Gulzar, Muhammad Ali	en
dc.contributor.department	Computer Science and Applications	en
dc.date.accessioned	2023-01-06T09:00:25Z	en
dc.date.available	2023-01-06T09:00:25Z	en
dc.date.issued	2023-01-05	en
dc.description.abstract	In the last few years, interest in the application of deep learning methods for software engineering tasks has surged. A variety of different approaches like transformer based methods, statistical machine translation models, models inspired from natural language settings have been proposed and shown to be effective at tasks like code summarization, code synthesis and code translation. Multiple benchmark data sets have also been released but all suffer from one limitation or the other. Some data sets only support a select few programming languages while others support only certain tasks. These limitations restrict researchers' ability to be able to perform thorough analyses of their proposed methods. In this work we aim to alleviate some of the limitations faced by researchers who work in the paradigm of deep learning applications for software engineering tasks. We introduce a large, parallel, multi-lingual programming language data set that supports tasks like code summarization, code translation, code synthesis and code search in 7 different languages. We provide benchmark results for the current state of the art models on all these tasks and we also explore some limitations of current evaluation metrics for code related tasks. We provide a detailed analysis of the compilability of code generated by deep learning models because that is a better measure of ascertaining usability of code as opposed to scores like BLEU and CodeBLEU. Motivated by our findings about compilability, we also propose a reinforcement learning based method that incorporates code compilability and syntax level feedback as rewards and we demonstrate it's effectiveness in generating code that has less syntax errors as compared to baselines. In addition, we also develop a web portal that hosts the models we have trained for code translation. The portal allows translation between 42 possible language pairs and also allows users to check compilability of the generated code. The intent of this website is to give researchers and other audiences a chance to interact with and probe our work in a user-friendly way, without requiring them to write their own code to load and inference the models.	en
dc.description.abstractgeneral	Deep neural networks have now become ubiquitous and find their applications in almost every technology and service we use today. In recent years, researchers have also started applying neural network based methods to problems in the software engineering domain. Software engineering by it's nature requires a lot of documentation, and creating this natural language documentation automatically using programs as input to the neural networks has been one their first applications in this domain. Other applications include translating code between programming languages and searching for code using natural language as one does on websites like stackoverflow. All of these tasks now have the potential to be powered by deep neural networks. It is common knowledge that neural networks are data hungry and in this work we present a large data set containing codes in multiple programming languages like Java, C++, Python, C#, Javascript, PHP and C. Our data set is intended to foster more research in automating software engineering tasks using neural networks. We provide an analysis of performance of multiple state of the art models using our data set in terms of compilability, which measures the number of syntax errors in the code, as well as other metrics. In addition, propose our own deep neural network based model for code translation, which uses feedback from programming language compilers in order to reduce the number of syntax errors in the generated code. We also develop and present a website where some of our code translation models have been hosted. The website allows users to interact with our work in an easy manner without any knowledge of deep learning and get a sense of how these technologies are being applied for software engineering tasks.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:36270	en
dc.identifier.uri	http://hdl.handle.net/10919/113065	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Deep Learning	en
dc.subject	Code Dataset	en
dc.subject	Code Translation	en
dc.subject	Software Development	en
dc.subject	Compilation	en
dc.subject	Reinforcement Learning	en
dc.title	Deep Learning for Code Generation using Snippet Level Parallel Data	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Jain_A_T_2023.pdf
Size:: 2.36 MB
Format:: Adobe Portable Document Format

Download

Name:: Jain_A_T_2023_support_1.pdf
Size:: 24.3 KB
Format:: Adobe Portable Document Format
Description:: Supporting documents

Download

Collections

Masters Theses