Privacy-Preserving Scanning of Big Content for Sensitive Data Exposure with MapReduce
The exposure of sensitive data in storage and transmission poses a serious threat to organizational and personal security. Data leak detection aims at scanning content (in storage or transmission) for exposed sensitive data. Because of the large content and data volume, such a screening algorithm needs to be scalable for a timely detection. Our solution uses the MapReduce framework for detecting exposed sensitive content, because it has the ability to arbitrarily scale and utilize public resources for the task, such as Amazon EC2. We design new MapReduce algorithms for computing collection intersection for data leak detection. Our prototype implemented with the Hadoop system achieves 225 Mbps analysis throughput with 24 nodes. Our algorithms support a useful privacy-preserving data transformation. This transformation enables the privacy-preserving technique to minimize the exposure of sensitive data during the detection. This transformation supports the secure outsourcing of the data leak detection to untrusted MapReduce and cloud providers.