Fast and Accurate Graph Clustering
| dc.contributor.author | Wanye, Frank Derry | en |
| dc.contributor.committeechair | Feng, Wu-Chun | en |
| dc.contributor.committeemember | Raghvendra, Sharath | en |
| dc.contributor.committeemember | Heath, Lenwood S. | en |
| dc.contributor.committeemember | Nikolopoulos, Dimitrios S. | en |
| dc.contributor.committeemember | Gleyzer, Vitaliy | en |
| dc.contributor.department | Computer Science and#38; Applications | en |
| dc.date.accessioned | 2025-10-29T08:00:12Z | en |
| dc.date.available | 2025-10-29T08:00:12Z | en |
| dc.date.issued | 2025-10-28 | en |
| dc.description.abstract | Graph clustering, also known as community detection, is a fundamental problem in graph analytics with applications across a wide variety of domains including bioinformatics, social media analysis, and anomaly detection. Graph clustering algorithms can be grouped into two categories: inferential and descriptive. While inferential algorithms deliver high accuracy, they do so with poor performance and poor scalability, particularly on large graphs. Conversely, descriptive algorithms are fast and scalable but significantly less accurate. In this manuscript, we focus on accelerating the performance and improving the scalability of stochastic block partitioning (SBP), a bottom-up (agglomerative) inferential graph-clustering algorithm based on sequential Bayesian inference. Like other inferential algorithms, SBP is highly accurate even on graphs with a complex community structure but does not scale well to large real-world graphs that can contain upwards of a million vertices. Our first set of contributions centers around performing data reduction via sampling as well as shared- and distributed-memory parallelization for accelerating the bottom-up formulation of SBP. We integrate these approaches into a unified, accelerated bottom-up SBP implementation, which exhibits up to 322X speedup over sequential SBP on a 1,000,000-vertex graph processed on 64 compute nodes with 256GB of memory. Next, we fundamentally re-factor SBP's bottomup approach into a top-down one and modify the previous data reduction and parallelization methodology to realize an integrated and accelerated top-down SBP algorithm. This new algorithm demonstrates up to 403X speedup over sequential SBP on real-world graphs when run on 64 compute nodes with 48 cores and 32GB of memory. Furthermore, the top-down computational approach enables SBP to process up to 4.4X larger graphs — from 862,664 vertices to 3,774,768 vertices — on the same 32GB of memory. Importantly, our approach accelerates SBP without sacrificing the algorithm's accuracy on both real-world and synthetic graph datasets. | en |
| dc.description.abstractgeneral | Graphs are a common type of relational dataset where individual data items form relationships with other data items. These relationships could be implicit, like the similarity of two patients' DNA samples, or explicit, like one social media user following another. Graph clustering, also known as community detection, refers to the process of finding groups of closely connected vertices in such a dataset. This problem is important in many areas, such as understanding social networks, analyzing biological systems, and detecting anomalies in data. Algorithms for graph clustering typically fall into two categories: accurate but slow methods that struggle to process large graphs in a reasonable amount of time and faster methods that often sacrifice accuracy. This dissertation focuses on improving both the speed and scalability of stochastic block partitioning (SBP), a slow but highly accurate graph clustering method based on statistical inference. While SBP excels at detecting complex clustering patterns, it faces significant computational challenges when applied to large graphs with upwards of millions of vertices. To address these challenges, this work introduces novel methods for accelerating SBP through techniques like reducing data volume, rethinking the computational paradigm, using multi-core processors for parallel computing, and scaling the computation across multiple computing machines. These methods can be combined to result in a highly optimized version of SBP that delivers a speed-up of 403X over its serial version on real-world graphs when run on 64 compute servers with 48 computing cores each, all without a significant difference in accuracy. They also allow us to process graphs with up to 4.4X more vertices on the same hardware. | en |
| dc.description.degree | Doctor of Philosophy | en |
| dc.format.medium | ETD | en |
| dc.identifier.other | vt_gsexam:42592 | en |
| dc.identifier.uri | https://hdl.handle.net/10919/138797 | en |
| dc.language.iso | en | en |
| dc.publisher | Virginia Tech | en |
| dc.rights | In Copyright | en |
| dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
| dc.subject | graph clustering | en |
| dc.subject | community detection | en |
| dc.subject | stochastic block partitioning | en |
| dc.subject | graph analysis | en |
| dc.title | Fast and Accurate Graph Clustering | en |
| dc.type | Dissertation | en |
| thesis.degree.discipline | Computer Science & Applications | en |
| thesis.degree.grantor | Virginia Polytechnic Institute and State University | en |
| thesis.degree.level | doctoral | en |
| thesis.degree.name | Doctor of Philosophy | en |
Files
Original bundle
1 - 1 of 1