Designing PhelkStat: Big Data Analytics for System
Event Logs
Mohammed Salman∗, Brian Welch∗ and Joseph Tront∗
∗Department of Electrical and Computer Engineering
Virginia Tech, Blacksburg
David Raymond†, Randy Marchany†
† IT Security Office
Virginia Tech, Blacksburg
Abstract—With wider adoption of micro-service based archi-
tectures in cloud and distributed systems, logging and monitoring
costs have become increasingly relevant topics of research.
There are a large number of log analysis tools such as the
ELK(ElasticSearch, Logstash and Kibana) stack, Apache Spark,
Sumo Logic, and Loggly, among many others. These tools have
been deployed to perform anomaly detection, diagnose threats,
optimize performance, and troubleshoot systems. Due to the real-
time and distributed nature of logging, there will always be a need
to optimize the performance of these tools; this performance
can be quantified in terms of compute, storage, and network
utilization. As part of the Information Technology Security Lab at
Virginia Tech, we have the unique ability to leverage production
data from the university network for research and testing. We
analyzed the workload variations from two production systems at
Virginia Tech, finding that the maximum workload is about four
times the average workload. Therefore, a static configuration can
lead to an inefficient use of resources. To address this, we propose
PhelkStat: a tool to evaluate the temporal and spatial attributes
of system workloads, using clustering algorithms to categorize
the current workload. Using PhelkStat, system parameters can be
automatically tweaked based on the workload. This paper reviews
publicly available system event log datasets from supercomputing
clusters and presents a statistical analysis of these datasets. We
also show a correlation between these attributes and the runtime
performance.
Index Terms—Log Analysis, Data Mining, Cloud Computing,
Cybersecurity, ELK Stack.
I. INTRODUCTION
Log analysis and monitoring involve extracting important
log information to identify key temporal system events. Cloud
service providers such as IBM, Amazon Web Services, and
Red Hat make use of log analysis extensively. Additionally, log
analysis is an indispensable tool in the area of cybersecurity.
With hackers using sophisticated tools and algorithms to try to
break into systems, it is important for organizations to devise
increasingly effective ways to protect against such attacks.
Some of the most reliable and accurate tools in cybersecurity
are the event and audit logs created by network devices. While
log analysis initially started out as a means for troubleshooting
faulty systems [1], log files are now also used for monitoring
system and network activity [2].
Micro-service architectures have become quite prevalent in
the IaaS (Infrastructure as a Service) industry. With the pay-as-
you-go model becoming widely adopted, where the consumer
pays based on resource usage, service providers must adopt a
fine-grained monitoring approach. Service providers need to
Fig. 1. Phelk Setup at ITSL
allocate resources (compute capacity, network transfer, per-
sistent storage, etc.) to carry out this fine-grained monitoring,
increasing the overhead of hosting these services. Log analysis
tools are crucial to the accurate monitoring of these large,
distributed systems.
With increasing emphasis on cybersecurity, log files are now
increasingly being used for threat diagnosis and mitigation
[3]. Network traces and system event logs are used to build
anomaly detection tools. The 2016 Internet Security Threat
report [4] from Symantec shows the number of exposed
identities at 423 million in the year 2015, up by 143% from
2014. More importantly, the number of web attacks blocked
per day was 1.1 million, which increased by 117%. Predictive
algorithms, which employ machine learning and deep learning,
are increasingly used to protect against attacks. These tools
use logs as the training data and predict anomalies and threats
based on real-time log data.
Building log analysis tools to extract semantic log informa-
tion is an extremely challenging problem. At the Information
Technology Security Lab (ITSL) at Virginia Tech, we work
at the intersection of big data and cybersecurity. To further
cybersecurity research, we have deployed an Elasticsearch,
Logstash, and Kibana (ELK) cluster, as shown in Figure 1,
to stream data from the production network at Virginia Tech.
Data sources include snort data, wireless network associations,
and syslog data from servers maintained by the Office of
Networking and Information Systems. These data feeds are fed
into a Kafka store before being ingested into Logstash, which
performs Extract, Transform, and Load (ETL) operations on
the data. The filtered data is then fed to an Elasticsearch cluster
where it can be queried. Kibana serves as the visualization tool
for these feeds. The runtime performance of the ELK Cluster
(CPU utilization and I/O) is tracked using the collectd daemon,
which sends the resource utilization parameters to a separate
ELK instance. The current setup does not contain the PhelkStat
implementation and the block shown in the diagram is where
we intend to instrument PhelkStat in the setup. One of the
main challenges we face is optimizing the performance of the
Elasticsearch cluster and exploring other ingestion tools, such
as Fluentd, in order to optimize system performance.
Log analysis is an extremely challenging problem and it is
impossible for us to propose a general solution. We focus on
system event logs in particular with these objectives in mind:
• Survey existing publicly maintained event log datasets; in
addition to providing us training data for PhelkStat, it also
serves as a collection of datasets which other researchers
may find useful in the future.
• Characterize system workloads by identifying quantita-
tive attributes such as arrival rate, event log size, event
criticality level, and content of anomalous data.
• Develop a model to find the optimal configuration param-
eters based on the workload.
• Generate log templates using NLP techniques to create
artificial, but realistic, datasets.
• Extract user-sensitive fields such as host names and
IP addresses, which could be subsequently used by
anonymization tools to sanitize datasets.
II. BACKGROUND
We collected one month of system event log data (October
2016) from the Advanced Research Computing (ARC) group
at Virginia Tech. ARC provides compute clusters to perform
research. The event log data is from the New River system, a
134 node cluster with the following technical specifications:
TABLE I
ARC CLUSTER(NEW RIVER
Compute Engine No of Nodes Type of CPU Cores Memory
General 100 2 x E5-2680v3 2.5GHz(Haswell) 24 128GB
Big Data 16 2 x E5-2680v3 2.5GHz(Haswell) 24 512 GB
GPU 8 2 x E5-2680v3 2.5GHz(Haswell) 24 512 GB
Interactive 8 2 x E5-2680v3 2.5GHz(Haswell) 24 256 GB
Very Large Memory 2 4 x E7-4890v2 2.8GHz(Ivy Bridge) 60 3 TB
In Figure 2, we calculated the number of event logs being
generated per minute normalized to a 24 hour cycle. We
posit that the arrival rate is directly correlated to the system
workload. To establish the baseline, the mean arrival rate was
Fig. 2. Arrival Rate Distribution per minute for ARC Dataset
calculated and found to be about 6,000 samples per second.
The maximum arrival rate was found to be about four times
the mean value. We also evaluated the hourly and monthly
variations and found the standard deviation of the workload
to be about 50% of the mean value. Therefore, using the
same configuration parameters without regard to the workload
variation will lead to bad performance.
Another area of investigation would be the over-
commitment of resources in periods where the load is less
than the mean arrival rate. In a similar analysis, [5] evaluated
over-commitment of resources in OpenStack and found the
storage overhead to be 80%.
III. RELATED WORK
The increasing relevance and scale of system log mining has
led to a variety of approaches to derive meaning from the con-
tinual streams of logged system events. Accurate contextual
log analysis has already seen practical application: Suleiman et
al. describe the use of cloud log forensics to identify malicious
behavior through log analysis [6].
Xu et al. detailed the strengths and weaknesses of different
methods for large-scale system log processing, including tai-
lored scripts, relational databases, and stream-based process-
ing. Their group provided detailed analysis explaining that the
temporal, real-time, and distributed nature of system log events
lends itself to stream-based processing. Working from this
analysis, they implemented a decision tree-based algorithm
to provide service operators with correlations between log
attributes. Integral to their implementation was the concept of
continuous queries; they used the Telegraph Continuous Query
Engine (TelegraphCQ) to perform these queries. Their work,
aimed at root failure localization, provides a basis for more
advanced machine learning techniques targeted at contextual
analysis [7].
A large obstacle in the contextual analysis of system log
events is the inconsistency in their output format. Despite
attempts to standardize log output [8], there still lacks uni-
formity across the many sources and hierarchical logging of
modern systems. The generation of log templates allows for
the distinction between variable words in logging messages
and the concrete structure of the log message; Kobayashi et
al. refer to these two components as variables and descriptions,
respectively [9].
In order to process these different log formats and generate
log templates, Natural Language Processing (NLP) techniques
have been employed, such as Conditional Random Fields
(CRFs). Kobayashi et al. employed a CRF-based model to
attain greater than 99 percent accuracy in word-level compari-
son [9]. Due to the supervised nature of CRF-based algorithms,
their accuracy can depend largely on the quality and variation
of the provided training data; [9] noted that their model
performed poorly for log events for which they had limited
training data. Sufficient quantities of these training datasets can
be difficult to acquire. Efforts towards dataset categorization
have taken place in similar areas of research: a quantitative
study of datasets has been carried out by [10], where they
compare datasets in the area of computer vision and machine
language.
An alternative approach to structuring log templates through
NLP techniques is clustering. Clustering algorithms focus on
grouping similar data [11]. The recent LogCluster algorithm,
implemented and detailed by Vaarandi and Pihelgas, improves
on the shortcomings of many previous clustering algorithms
such as SLCT, and can be used to cluster similar log formats
while identifying a disjoint set of outlier logs [12]. The real-
time log data provided by current large-scale systems provides
many different contexts from which analysis can be gleaned;
however, each of the described techniques in this section has
excelled at log analysis for a tailored purpose, such as root
failure localization for a relatively narrow set of log formats
[7] or the generation of log templates for data sets with a wider
range of log formats [8]. These methods show promise in their
respective areas of focus, but we believe they could provide a
general purpose solution to the real-time log analysis dilemma
if combined in some capacity.
IV. QUALITY CRITERIA FOR SYSTEM EVENT LOGS
The quality of a dataset is heavily dependent on the system
being used for data collection. Large data centers serving
thousands of users are able to produce huge datasets which are
adequate to sufficiently test data analysis tools. Datasets can
suffer from the problem of correlated information, where the
event logs contain highly redundant information. Additionally,
in many cases, datasets are proprietary and cannot be accessed
in the public domain. In this section, we propose a set of
attributes that characterize system event log datasets. The
attributes can be divided into two categories:
A. Temporal attributes
• Arrival rate distribution: The distribution of the number
of samples occurring at a given instant of time. This
gives us an idea of whether the events are sparsely or
uniformly distributed, which can be used to characterize
the system workload. Most event logs tend to be sparsely
distributed with more messages being observed near an
anomaly condition.
• Events size distribution: Not all event logs are of the same
format. Some events specify normal routine messages
while some specify critical events or contain verbose
messages. We calculated the number of bytes in each
event log and plotted the distribution of the number of
bytes against the number of messages containing that
number of bytes.
B. Spatial attributes
• Anomaly events: There is a need to carry out this analysis
separately because non-anomalous events tend to contain
more bytes. Furthermore, syslog events could contain
multiple levels of critical events and we must estimate
the number of events for each critical level.
• Message type: Event logs can be categorized into multiple
categories such as sshd, kernel, or RPC, among many oth-
ers. We must estimate the number of events corresponding
to each category.
• Contextual log analysis: To derive semantic meaning
from log messages, we can employ NLP techniques
to generate log templates, using CRFs to form word
associations.
• Anonymizable content: This includes the information
which can be anonymized such as hostname, IP ad-
dresses, and port number. This is important because this
information could be used by an anonymization tool to
better tune the algorithm or to identify the actual content
that needs to be optimized.
These attributes provide a set of quantitative metrics to
characterize the system workload. These system metrics serve
as inputs to our clustering algorithm to categorize datasets.
In order to maintain a mapping between the metrics and
performance, we ingest these datasets into our ELK cluster
and measure runtime parameters such as CPU utilization and
file I/O, among other parameters.
V. DATASETS
We have collected a mix of publicly available event logs
datasets and proprietary event logs from two organizations
within Virginia Tech: the Office of Networking and Informa-
tion Systems and the Advanced Research Computing group.
A. Publicly available datasets
The publicly available datasets being used for evaluation are
shown in Table II and described below.
1) The Computer Failure Data Repository (CFDR): CFDR
was started as project at CMU in 2006 to accelerate research
by providing system event logs from a variety of production
systems. We are using three datasets from CFDR:
• HPC4, containing three traces
1) Spirit2
TABLE II
LIST OF PUBLIC DATASETS
Name Time period System type Size(GB) Description
Thunderbird2 2005- 2006 HPC cluster 27072 Dell System at SNL.
Liberty2 2004-2005 HPC Cluster 944 HP System at SNL.
Spirit2 2005-2007 HPC Cluster 1024 HP System at SNL.
Cray 2008 Cray systems 1.52 Event logs, console logs, and syslog from
Cray XT series machines running Linux.
dartmouth/campus 2001-2004 Wireless network 1.1 Time stamped sanitized syslog records
from over 450 access points over a period
of 5 years.
2) Thunderbird2
3) Liberty2
A more detailed description of the dataset is give in [13].
• Cray
2) Crawdad: Started by Dartmouth University to facilitate
sharing of datasets captured from wireless networks, this
collection contains 119 datasets contributed by the research
community. We are using the dartmouth/campus traceset [14].
B. Private Datasets
1) Advanced Research Computing (ARC): The ARC group
is an organization at Virginia Tech focused on providing cen-
tralized support for research computing by building, operating,
and promoting the use of advanced cyberinfrastructure. ARC
maintains seven clusters, six of which contain more than fifty
nodes. In collaboration with ARC, we have been able to access
the system event logs of these servers for a period of one
month.
2) Information Technology Security Lab (ITSL): As shown
in Figure 1, the data collected consists of about twelve months
of wireless and system traces. Approximately two months of
data were randomly selected to carry out the analysis.
C. Design
We propose Phelkstat, a tool which calculates the attributes
introduced in Section IV. Due to the disparate nature of event
log formats, using a single template to parse the fields is not
possible. Also, given the huge size of these datasets, we used
Apache Spark as the stream-processing framework to build
our tool. The experiments were performed on a four node
Spark cluster. The current implementation does not generate
log templates using the aforementioned CRF algorithm; rather,
we generate templates based on the message types extracted
from the log.
The workflow is outlined in Figure 3. Each of the functions
arrival rate, arrival bytes, anomaly content and message type
further generate their own RDD (Spark’s Resilient Distributed
Dataset) by defining their own map operations. There are three
stages in the workflow:
• Stage 1: Input phase; the traceset file is loaded into an
RDD. Spark also provides the option to load multiple
trace files into one RDD.
Fig. 3. PhelkStat Block Diagram
• Stage 2: Map phase; Spark stores the intermediate results
in memory as RDDs. In this stage, we create separate
RDDs to calculate the various attributes. The temporal
attributes rely on extracting the timestamp and are eval-
uated by using the same RDD.
• Stage 3: Reduce phase; we combine the results from
each branch of Stage 2 to aggregate the results. For the
temporal parameters, we maintain a hashmap where the
key is the minute of the day and the value is the attribute
value. The spatial parameters are calculated by extracting
the keywords from the logs.
VI. PRELIMINARY EVALUATION
A typical log event is characterized by a timestamp and
an entry containing the event type, node information, and a
message. An example of an event log of this format follows:
2005.01.01 sadmin1 Jan 1 00:00:07 kernel: hda: status error: status=0x00
Fig. 4. Arrival Rate Distribution
We model the event log as:
D = (x1, t1), (x2, t2), (x3, t3), ..., (xn, tn) (1)
where xi represents the event log and ti represents the
timestamp, respectively.
In the estimation of the temporal attributes, we use a
piecewise approach which breaks down the dataset into time
sub-intervals, evaluating the attributes on each of these sub-
intervals. The sub-intervals of time in a dataset D given by (1)
are represented by V:
V1 = Num. of samples(t1 + δt)
V2 = Num. of samples(t1 + 2δt)− V1
V = V1, V2, ...., Vm
A. Arrival Rate Distribution
This is the estimate of the number of samples for each of
the sub-intervals of time. The granularity of the sub-interval
is essential to maintain the normal distribution of samples.
We estimate the mean value and the standard deviation of the
sample distribution using the formula:
u =
(
∑M
i=1 Vi)
M
V ar = (Vi − u)2
σ = sqrt(V ar/M)
Figure 4 shows a plot of the event arrival rate distribution
against time. As the datasets are of varying size, the count
has been normalized by the total number of samples in the
respective datasets. It can be seen that the Thunderbird2
system has a bursty workload with multiple variations. The
Liberty2 and Spirit2 systems have similar workload variations
and therefore can be grouped into the same category with
regard to configuration parameters. We also see periodic stubs
of activity interspersed with spikes, indicating that a load
balancer has been deployed to distribute traffic among nodes.
Fig. 5. Arrival Bytes Distribution
B. Distribution of Bytes per Interval
Each event log differs in size, in terms of its number of
bytes. Following a similar method to our approach for arrival
rate distribution, we use:
Bi = Num. of Bytes(t1 + iδt)−Bi−1
B = B1, B2, ...Bm, B0 = 0
Figure 5 shows the arrival rate distribution in terms of
bytes. The ITSL dataset has significant variation in the number
of bytes arriving as the day progresses. We can infer from
this that there is either some system maintenance work being
performed or a set of scheduled jobs is being run. A detailed
analysis of the message types at this point of time in the event
log could shows us that the increase is to cron jobs being
scheduled.
C. Message Type
The determination of the message type in an event log
is straightforward; the message types we encountered were
typically one of sshd, kernel, syslog-ng, or RPC, although
we have come across other message types, as well. This
information is useful for log template generation, as each
message type will need to have its own template.
This message type information is maintained in a hashmap
where a key is the message type (mtype) and value given by:
V alk =
N∑
i=1
I(mtype = k)
where I(a=b) is an indicator function given by:{
0 a 6= b
1 a == b
Fig. 6. Message Type Distribution
Figure 6 shows a plot of the distribution of the message
type for each of the datasets previously mentioned. This
information is helpful in generating artificial datasets using
log templates and generating separate indices for each message
type in Elasticsearch to improve performance.
D. Anomaly Content
The syslog protocol defines seven severity levels given
by Emergency, Alert, Critical, Error, Warning, Notice, Info,
and Debug [8]. The number of messages belonging to each
category are counted for each dataset by traversing through
the event logs and are outlined in Table III.
TABLE III
DISTRIBUTION OF MESSAGE CRITICAL LEVELS
Dataset Total Values Error Panic Critical Warning Info
CRAY 6214940 324420 14040 0 22080 5881400
LIBERTY2 265569231 4904141 20 18201 167114 160579755
NewRiver 22144269 3645133 4618 1 1236769 17257748
SPIRIT2 211212192 12722989 16 2413982 7908396 18166809
THUNDERBIRD2 272298969 76666612 102 26534 2135493 37508
ITSL 1.490667342e+09 253350 1264513232 155486153 563746 1,424,636,833
E. Runtime Analysis
In order to create a mapping between the statistical attributes
and the runtime performance, we ingested the data into an
ELK stack similar to the one described in Section I and mea-
sured two runtime parameters, CPU Utilization and Memory
Consumption. The experiments were run for three iterations
and the mean of the results has been presented here.
The effect of arrival rate and the byte rate can be explained
intuitively: as the number of messages to be indexed increase,
more resources are required. [15] shows that grouping mes-
sages of one type into the same index reduces the number
of shards to be searched which leads to an improvement in
performance.
TABLE IV
RUNTIME ANALYSIS
DataSet Mean Arrival Rate/s Mean CPU Utilization Mean Memory Consumption(Gb)
CRAY 4007.416 19.72 18.084
LIBERTY2 3073.717 7.912 18.0043
NEWRIVER 2656.28 8.74163 12.9472
SPIRIT2 3151.6 7.4063 15.6784
THUNDERBIRD2 2444.583 4.5 15.44
As seen in Table IV, there is a correlation between the
arrival rate and the resource consumption. There does seem
to be some anomalies with regard to New River, which has
a higher CPU utilization despite a lower arrival rate. The
analysis presented here is only with respect to the arrival rate.
We believe that the anomalies can be explained by considering
other attributes such as event bytes size, message type, among
others.
Another attribute that we have not taken into account is the
anonymizable content. Elasticsearch can perform tokenization,
where it breaks down complex strings. For example, nr-001
will be split into nr and 001, both of which will be indexed
separately. The anonymized content, such as hostname and
IP address, does not have to be split. PhelkStat can idenitfy
anonymizable fields and the tokenization on these fields can
be disabled to improve performance.
VII. LIMITATIONS AND FUTURE WORK
Our current work presents a statistical evaluation of datasets
along with a preliminary mapping to the runtime parameters.
One of the limitations of our current framework is that once we
know the optimum parameters, we need to restart the service
after making the configuration changes. The next step is to
tune configuration parameters of log analysis tools for a dataset
and validate if that set of parameters optimizes performance
with respect to another statistically similar dataset. The cate-
gorization of workloads will be conducted using unsupervised
learning algorithms described in [16], [17] and [18].
With regard to implementing the Contextual Log Analysis,
we plan to begin with the CRF-based algorithm used by
Kobayashi et al. [9]. The LogCluster algorithm, as described
in [12], will then be applied to enhance the training stage
of the CRF-based algorithm. In our proposed pipeline, log
training data is first processed by the LogCluster algorithm,
determining clusters of log messages as well as generating a
set of outliers. Additional features based on shared presence
with neighboring words in LogCluster-generated clusters are
then added to words in the CRF training data. Weak areas in
the training data are identified by their presence in the set of
outlier log events. Additionally, variation of the LogCluster
algorithms support threshold parameter, which influences the
number of clusters generated [12], allows for control over the
impact of this pre-processing step in the trained CRF model.
VIII. CONCLUSION
The preliminary analysis presented here shows a correlation
between the statistical parameters and the runtime evaluation
after normalizing with respect to the size of the datasets.
Although the arrival rate and the byte count distribution
present the same general trend, the byte distribution does
show anomalies not seen in the arrival rate distribution. Using
byte distribution as a feature vector in classification will help
in improving accuracy. The attributes need to be assigned
weights based on the type of application. For example, if
the application performs Anomaly detection, it would assign
a higher weight to the anomaly content attribute. On the
other hand, to optimize system storage, the byte distribution
would be assigned a higher priority. PhelkStat has currently
been tested with ELK as the application primarily to optimize
our current setup. Each application has its own configurable
parameters. While optimizing for different applications does
remain an area of concern, we believe that the approach we
have proposed can be generalized as long as there exists a
mapping between the attributes and configuration parameters
of an application.As the scale of distributed systems increases,
so does the need for tools that can perform anomaly and
threat detection, analyze performance metrics, and provide
semantic log analysis. Additionally, the real-time nature of
these large-scale systems requires that these types of tools
have some degree autonomy to tune their analyses to their
contemporary input streams. A tool which can perform the
aforementioned analyses and adequately scale with the dis-
tributed systems of cloud providers will have widespread
implications, ranging from threat mitigation and detection in
the cybersecurity domain to micro-service usage assessment
in the cloud hosting domain. Our proposed PhelkStat will
provide a base in temporal and spatial analysis from which
these scalable problems can be handled.
IX.
ACKNOWLEDGMENT
The authors would like to thank Forest Godfrey (Cray
dataset) and Jon Stearley and Adam Oliner (HPC 4 dataset)
for making the public datasets available. We would also like
to thank the Advanced Research Computing organization at
Virginia Tech for providing us event logs from their research
computing cluster.
REFERENCES
[1] U. Flegel, “Pseudonymizing unix log files,” in Infrastructure Security.
Springer, 2002, pp. 162–179.
[2] K. Kent, “Guide to computer security log management,” 2007.
[3] A. Ambre and N. Shekokar, “Insider threat detection using log analysis
and event correlation,” Procedia Computer Science, vol. 45, pp. 436–
445, 2015.
[4] “istr-21-2016-en.pdf,” https://www.symantec.com/content/dam/symantec/
docs/reports/istr-21-2016-en.pdf, (Accessed on 11/23/2016).
[5] A. Anwar, A. Sailer, A. Kochut, and A. R. Butt, “Anatomy of
cloud monitoring and metering: A case study and open problems,”
in Proceedings of the 6th Asia-Pacific Workshop on Systems, ser.
APSys ’15. New York, NY, USA: ACM, 2015, pp. 6:1–6:7. [Online].
Available: http://doi.acm.org/10.1145/2797022.2797039
[6] S. Khan, A. Gani, A. W. A. Wahab, M. A. Bagiwa, M. Shiraz,
S. U. Khan, R. Buyya, and A. Y. Zomaya, “Cloud log forensics:
Foundations, state of the art, and future directions,” ACM Comput.
Surv., vol. 49, no. 1, pp. 7:1–7:42, May 2016. [Online]. Available:
http://doi.acm.org/10.1145/2906149
[7] D. P. Wei Xu, Peter Bodk, “A flexible architecture for statistical
learning and data mining from system log streams,” in Temporal
Data Mining: Algorithms, Theory and Applications. IEEE,
January 2004. [Online]. Available: https://www.microsoft.com/en-
us/research/publication/a-flexible-architecture-for-statistical-learning-
and-data-mining-from-system-log-streams/
[8] R. Gerhards, “The syslog protocol,” 2009.
[9] S. Kobayashi, K. Fukuda, and H. Esaki, “Towards an nlp-based log
template generation algorithm for system log analysis,” in Proceedings
of The Ninth International Conference on Future Internet Technologies,
ser. CFI ’14. New York, NY, USA: ACM, 2014, pp. 11:1–11:4.
[Online]. Available: http://doi.acm.org/10.1145/2619287.2619290
[10] F. Ferraro, N. Mostafazadeh, T.-H. K. Huang, L. Vanderwende, J. Devlin,
M. Galley, and M. Mitchell, “A survey of current datasets for vision and
language research,” in Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, 2015, pp. 207–213.
[11] R. Vaarandi, “A data clustering algorithm for mining patterns from event
logs,” in Proceedings of the 3rd IEEE Workshop on IP Operations
Management (IPOM 2003) (IEEE Cat. No.03EX764), Oct 2003, pp.
119–126.
[12] R. Vaarandi and M. Pihelgas, “Logcluster - a data clustering
and pattern mining algorithm for event logs,” in Proceedings of
the 2015 11th International Conference on Network and Service
Management (CNSM), ser. CNSM ’15. Washington, DC, USA:
IEEE Computer Society, 2015, pp. 1–7. [Online]. Available:
http://dx.doi.org/10.1109/CNSM.2015.7367331
[13] A. Oliner and J. Stearley, “What supercomputers say: A study of five
system logs,” in 37th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN’07). IEEE, 2007, pp. 575–
584.
[14] D. Kotz, T. Henderson, I. Abyzov, and J. Yeo, “CRAWDAD
dataset dartmouth/campus (v. 2009-09-09),” Downloaded from
http://crawdad.org/dartmouth/campus/20090909, Sep. 2016.
[15] “Tuning data aggregation and query performance with elasticsearch
on azure — microsoft docs,” https://docs.microsoft.com/en-
us/azure/guidance/guidance-elasticsearch-tuning-data-aggregation-
and-query-performance, (Accessed on 11/23/2016).
[16] S. Balaji and S. Srivatsa, “Unsupervised learning in large datasets for
intelligent decision making.”
[17] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping
multidimensional data. Springer, 2006, pp. 25–71.
[18] S. Zanero and S. M. Savaresi, “Unsupervised learning techniques for an
intrusion detection system,” in Proceedings of the 2004 ACM symposium
on Applied computing. ACM, 2004, pp. 412–419.