Detecting Bots using Stream-based System with Data Synthesis
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Machine learning has shown great success in building security applications including bot detection. However, many machine learning models are difficult to deploy since model training requires the continuous supply of representative labeled data, which are expensive and time-consuming to obtain in practice. In this thesis, we build a bot detection system with a data synthesis method to explore detecting bots with limited data to address this problem. We collected the network traffic from 3 online services in three different months within a year (23 million network requests). We develop a novel stream-based feature encoding scheme to support our model to perform real-time bot detection on anonymized network data. We propose a data synthesis method to synthesize unseen (or future) bot behavior distributions to enable our system to detect bots with extremely limited labeled data. The synthesis method is distribution-aware, using two different generators in a Generative Adversarial Network to synthesize data for the clustered regions and the outlier regions in the feature space. We evaluate this idea and show our method can train a model that outperforms existing methods with only 1% of the labeled data. We show that data synthesis also improves the model's sustainability over time and speeds up the retraining. Finally, we compare data synthesis and adversarial retraining and show they can work complementary with each other to improve the model generalizability.