Designing PhelkStat: Big Data Analytics for System Event Logs
With wider adoption of micro-service based architectures in cloud and distributed systems, logging and monitoring costs have become increasingly relevant topics of research. There are a large number of log analysis tools such as the ELK(ElasticSearch, Logstash and Kibana) stack, Apache Spark, Sumo Logic, and Loggly, among many others. These tools have been deployed to perform anomaly detection, diagnose threats, optimize performance, and troubleshoot systems. Due to the real-time and distributed nature of logging, there will always be a need to optimize the performance of these tools; this performance can be quantified in terms of compute, storage, and network utilization. As part of the Information Technology Security Lab at Virginia Tech, we have the unique ability to leverage production data from the university network for research and testing. We analyzed the workload variations from two production systems at Virginia Tech, finding that the maximum workload is about four times the average workload. Therefore, a static configuration can lead to an inefficient use of resources. To address this, we propose PhelkStat: a tool to evaluate the temporal and spatial attributes of system workloads, using clustering algorithms to categorize the current workload. Using PhelkStat, system parameters can be automatically tweaked based on the workload. This paper reviews publicly available system event log datasets from supercomputing clusters and presents a statistical analysis of these datasets. We also show a correlation between these attributes and the runtime performance.