Fast and Accurate Graph Clustering

Wanye, Frank Derry

Fast and Accurate Graph Clustering

dc.contributor.author	Wanye, Frank Derry	en
dc.contributor.committeechair	Feng, Wu-Chun	en
dc.contributor.committeemember	Raghvendra, Sharath	en
dc.contributor.committeemember	Heath, Lenwood S.	en
dc.contributor.committeemember	Nikolopoulos, Dimitrios S.	en
dc.contributor.committeemember	Gleyzer, Vitaliy	en
dc.contributor.department	Computer Science and#38; Applications	en
dc.date.accessioned	2025-10-29T08:00:12Z	en
dc.date.available	2025-10-29T08:00:12Z	en
dc.date.issued	2025-10-28	en
dc.description.abstract	Graph clustering, also known as community detection, is a fundamental problem in graph analytics with applications across a wide variety of domains including bioinformatics, social media analysis, and anomaly detection. Graph clustering algorithms can be grouped into two categories: inferential and descriptive. While inferential algorithms deliver high accuracy, they do so with poor performance and poor scalability, particularly on large graphs. Conversely, descriptive algorithms are fast and scalable but significantly less accurate. In this manuscript, we focus on accelerating the performance and improving the scalability of stochastic block partitioning (SBP), a bottom-up (agglomerative) inferential graph-clustering algorithm based on sequential Bayesian inference. Like other inferential algorithms, SBP is highly accurate even on graphs with a complex community structure but does not scale well to large real-world graphs that can contain upwards of a million vertices. Our first set of contributions centers around performing data reduction via sampling as well as shared- and distributed-memory parallelization for accelerating the bottom-up formulation of SBP. We integrate these approaches into a unified, accelerated bottom-up SBP implementation, which exhibits up to 322X speedup over sequential SBP on a 1,000,000-vertex graph processed on 64 compute nodes with 256GB of memory. Next, we fundamentally re-factor SBP's bottomup approach into a top-down one and modify the previous data reduction and parallelization methodology to realize an integrated and accelerated top-down SBP algorithm. This new algorithm demonstrates up to 403X speedup over sequential SBP on real-world graphs when run on 64 compute nodes with 48 cores and 32GB of memory. Furthermore, the top-down computational approach enables SBP to process up to 4.4X larger graphs — from 862,664 vertices to 3,774,768 vertices — on the same 32GB of memory. Importantly, our approach accelerates SBP without sacrificing the algorithm's accuracy on both real-world and synthetic graph datasets.	en
dc.description.abstractgeneral	Graphs are a common type of relational dataset where individual data items form relationships with other data items. These relationships could be implicit, like the similarity of two patients' DNA samples, or explicit, like one social media user following another. Graph clustering, also known as community detection, refers to the process of finding groups of closely connected vertices in such a dataset. This problem is important in many areas, such as understanding social networks, analyzing biological systems, and detecting anomalies in data. Algorithms for graph clustering typically fall into two categories: accurate but slow methods that struggle to process large graphs in a reasonable amount of time and faster methods that often sacrifice accuracy. This dissertation focuses on improving both the speed and scalability of stochastic block partitioning (SBP), a slow but highly accurate graph clustering method based on statistical inference. While SBP excels at detecting complex clustering patterns, it faces significant computational challenges when applied to large graphs with upwards of millions of vertices. To address these challenges, this work introduces novel methods for accelerating SBP through techniques like reducing data volume, rethinking the computational paradigm, using multi-core processors for parallel computing, and scaling the computation across multiple computing machines. These methods can be combined to result in a highly optimized version of SBP that delivers a speed-up of 403X over its serial version on real-world graphs when run on 64 compute servers with 48 computing cores each, all without a significant difference in accuracy. They also allow us to process graphs with up to 4.4X more vertices on the same hardware.	en
dc.description.degree	Doctor of Philosophy	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:42592	en
dc.identifier.uri	https://hdl.handle.net/10919/138797	en
dc.language.iso	en	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	graph clustering	en
dc.subject	community detection	en
dc.subject	stochastic block partitioning	en
dc.subject	graph analysis	en
dc.title	Fast and Accurate Graph Clustering	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science & Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Doctor of Philosophy	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Wanye_FD_D_2025.pdf
Size:: 6.69 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations