Structure-based Optimizations for Sparse Matrix-Vector Multiply

dc.contributor.authorBelgin, Mehmeten
dc.contributor.committeecochairRibbens, Calvin J.en
dc.contributor.committeecochairBack, Godmar V.en
dc.contributor.committeememberSandu, Adrianen
dc.contributor.committeememberCameron, Kirk W.en
dc.contributor.committeememberGugercin, Serkanen
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2014-03-14T20:21:10Zen
dc.date.adate2011-01-16en
dc.date.available2014-03-14T20:21:10Zen
dc.date.issued2010-12-14en
dc.date.rdate2012-05-14en
dc.date.sdate2010-12-24en
dc.description.abstractThis dissertation introduces two novel techniques, OSF and PBR, to improve the performance of Sparse Matrix-vector Multiply (SMVM) kernels, which dominate the runtime of iterative solvers for systems of linear equations. SMVM computations that use sparse formats typically achieve only a small fraction of peak CPU speeds because they are memory bound due to their low flops:byte ratio, they access memory irregularly, and exhibit poor ILP due to inefficient pipelining. We particularly focus on improving the flops:byte ratio, which is the main limiter on performance, by exploiting recurring structures or sub-structures in matrices. Our techniques also support micro-architecture level optimizations to further improve performance. Operation Stacking Framework (OSF) stacks problems in large ensemble computations, which run the same sparse kernel using an identical matrix structure, such that they share a single copy of the indexing information to significantly reduce memory bandwidth usage. OSF provides performance improvements of up to 1.94x on an AMD Opteron compared to the CSR method. We validate performance results using hardware event counters, which demonstrate significantly improved cache and pipeline utilization. Pattern-based Representation (PBR) exploits recurring block nonzero patterns by generating custom code for each recurring block pattern. In this way, no indexing data for individual nonzero elements are read from memory, reducing the overall size of the indices by up to 98%. Our code generator emits highly tuned codes that utilize SSE vectorization and software prefetching. PBR accurately identifies a block size that achieves optimal or near-optimal performance using a linear multiple regression performance model. On recent multicore machines, PBR provides performance improvements of up to 3.4x sequentially and 5x in parallel, compared to the CSR method. The PBR library we provide converts matrices at runtime, allowing our method to be used as a drop-in replacement for existing methods. We compare PBR's overhead relative to its benefits and show that PBR is beneficial for many applications that repetitively call the SMVM kernel for the same matrix structure.en
dc.description.degreePh. D.en
dc.identifier.otheretd-12242010-124006en
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-12242010-124006/en
dc.identifier.urihttp://hdl.handle.net/10919/30260en
dc.publisherVirginia Techen
dc.relation.haspartBelgin_Mehmet_D_2010.pdfen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectCode Generatorsen
dc.subjectVectorizationen
dc.subjectSparseen
dc.subjectSpMVen
dc.subjectSMVMen
dc.subjectMatrix Vector Multiplyen
dc.subjectPBRen
dc.subjectOSFen
dc.subjectthread poolen
dc.subjectparallel SpMVen
dc.titleStructure-based Optimizations for Sparse Matrix-Vector Multiplyen
dc.typeDissertationen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Belgin_Mehmet_D_2010.pdf
Size:
4.69 MB
Format:
Adobe Portable Document Format