CAN-zip – Centroid Based Delta Compression of Next Generation Sequencing Data

Files

TR Number

TR-15-05

Date

2015-11-09

Journal Title

Journal ISSN

Volume Title

Publisher

Department of Computer Science, Virginia Polytechnic Institute & State University

Abstract

We present CANzip, a novel algorithm for compressing short read DNA sequencing data in FastQ format. CANzip is based on delta compression, a process in which only the differences of a specific data stream relative to a given reference stream are stored. However CANzip uniquely assumes no given reference stream. Instead it creates artificial references for different clusters of reads, by constructing an artificial representative sequence for each given cluster. Each cluster sequence is then recoded to indicate only how it differs relative to this artificially created reference sequence. Remodeling the data in this way greatly improves the compression ratio achieved when used in conjunction with commodity tools such as bzip2. Our results indicate that CANzip outperforms gzip on average and that it can outperform bzip2.

Description

Keywords

Bioinformatics, Big data

Citation