CAN-zip – Centroid Based Delta Compression of Next Generation Sequencing Data

dc.contributor.authorSteere, Edwarden
dc.contributor.authorAn, Linen
dc.contributor.authorZhang, Liqingen
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2015-11-09T20:18:14Zen
dc.date.available2015-11-09T20:18:14Zen
dc.date.issued2015-11-09en
dc.description.abstractWe present CANzip, a novel algorithm for compressing short read DNA sequencing data in FastQ format. CANzip is based on delta compression, a process in which only the differences of a specific data stream relative to a given reference stream are stored. However CANzip uniquely assumes no given reference stream. Instead it creates artificial references for different clusters of reads, by constructing an artificial representative sequence for each given cluster. Each cluster sequence is then recoded to indicate only how it differs relative to this artificially created reference sequence. Remodeling the data in this way greatly improves the compression ratio achieved when used in conjunction with commodity tools such as bzip2. Our results indicate that CANzip outperforms gzip on average and that it can outperform bzip2.en
dc.format.mimetypeapplication/pdfen
dc.identifier.trnumberTR-15-05en
dc.identifier.urihttp://hdl.handle.net/10919/63992en
dc.language.isoenen
dc.publisherDepartment of Computer Science, Virginia Polytechnic Institute & State Universityen
dc.relation.ispartofComputer Science Technical Reportsen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectBioinformaticsen
dc.subjectBig dataen
dc.titleCAN-zip – Centroid Based Delta Compression of Next Generation Sequencing Dataen
dc.typeTechnical reporten
dc.type.dcmitypeTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
CANzip.pdf
Size:
191.95 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: