Graph Mining Algorithms for Memory Leak Diagnosis and Biological Database Clustering
Maxwell, Evan Kyle
MetadataShow full item record
Large graph-based datasets are common to many applications because of the additional structure provided to data by graphs. Patterns extracted from graphs must adhere to these structural properties, making them a more complex class of patterns to identify. The role of graph mining is to efficiently extract these patterns and quantify their significance. In this thesis, we focus on two application domains and demonstrate the design of graph mining algorithms in these domains. First, we investigate the use of graph grammar mining as a tool for diagnosing potential memory leaks from Java heap dumps. Memory leaks occur when memory that is no longer in use fails to be reclaimed, resulting in significant slowdowns, exhaustion of available storage, and eventually application crashes. Analyzing the heap dump of a program is a common strategy used in memory leak diagnosis, but our work is the first to employ a graph mining approach to the problem. Memory leaks accumulate in the heap as classes of subgraphs and the allocation paths from which they emanate can be explored to contextualize the leak source. We show that it suffices to mine the dominator tree of the heap dump, which is significantly smaller than the underlying graph. We demonstrate several synthetic as well as real-world examples of heap dumps for which our approach provides more insight into the problem than state-of-the-art tools such as Eclipse's MAT. Second, we study the problem of multipartite graph clustering as an approach to database summarization on an integrated biological database. Construction of such databases has become a common theme in biological research, where heterogeneous data is consolidated into a single, centralized repository that provides a structured forum for data analysis. We present an efficient approximation algorithm for identifying clusters that form multipartite cliques spanning multiple database tables. We show that our algorithm computes a lossless compression of the database by summarizing it into a reduced set of biologically meaningful clusters. Our algorithm is applied to data from C. elegans, but we note its applicability to general relational databases.
- Masters Theses