Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data

Zhu, Ming; Karim, Mohimenul; Lourentzou, Ismini; Yao, Daphne

Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data

dc.contributor.author	Zhu, Ming	en
dc.contributor.author	Karim, Mohimenul	en
dc.contributor.author	Lourentzou, Ismini	en
dc.contributor.author	Yao, Daphne	en
dc.date.accessioned	2024-11-04T14:13:01Z	en
dc.date.available	2024-11-04T14:13:01Z	en
dc.date.issued	2024-10-27	en
dc.date.updated	2024-11-01T07:56:53Z	en
dc.description.abstract	Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the “shallow translation” problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.	en
dc.description.version	Published version	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.1145/3691620.3695524	en
dc.identifier.uri	https://hdl.handle.net/10919/121527	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.holder	The author(s)	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.title	Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3691620.3695524.pdf
Size:: 1.19 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles, Association for Computing Machinery (ACM)
Scholarly Works, Sanghani Center for Artificial Intelligence and Data Analytics