Browsing by Author "Viswanath, Bimal"
Now showing 1 - 20 of 23
Results Per Page
Sort Options
- Active Learning Under Limited Interaction with Data LabelerChen, Si (Virginia Tech, 2021)Active learning (AL) aims at reducing labeling effort by identifying the most valuable unlabeled data points from a large pool. Traditional AL frameworks have two limitations: First, they perform data selection in a multi-round manner, which is time-consuming and impractical. Second, they usually assume that there are a small amount of labeled data points available in the same domain as the data in the unlabeled pool. In this thesis, we initiate the study of one-round active learning to solve the first issue. We propose DULO, a general framework for one-round setting based on the notion of data utility functions, which map a set of data points to some performance measure of the model trained on the set. We formulate the one-round active learning problem as data utility function maximization. We then propose D²ULO on the basis of DULO as a solution that solves both issues. Specifically, D²ULO leverages the idea of domain adaptation (DA) to train a data utility model on source labeled data. The trained utility model can then be used to select high-utility data in the target domain and at the same time, provide an estimate for the utility of the selected data. Our experiments show that the proposed frameworks achieves better performance compared with state-of-the-art baselines in the same setting. Particularly, D²ULO is applicable to the scenario where the source and target labels have mismatches, which is not supported by the existing works.
- Android Game Testing using Reinforcement LearningKhurana, Suhani (Virginia Tech, 2023-06-30)Android is the most popular operating system and occupies close to 70% of the market share. With the growth in the usage of Android OS, the number of games also increased and the Android play store has over 500,000 games. Testing of Android games is done either manually or through some of the existing tools which automate some parts of this testing. Manual testing requires a great deal of effort and can be expensive to afford. The existing tools which automate testing do not make use of any domain knowledge. This can cause the testing to be ineffective as the game may involve complex strategies, intricate details, widgets, etc. Existing tools like Android Monkey and Time Machine generate random Android events, including gestures like touch, swipe, clicks, and other system-level events across the application. Some deep learning methods like Wuji were only created for combat-type games. These limitations make it imperative to create a testing paradigm that uses domain knowledge as well as is easy to use by a developer who doesn't have any machine or deep learning knowledge. In this work, we develop a tool called DRAG- Deep Reinforcement learning based Android Gamer - which leverages Reinforcement Learning to learn the requisite domain knowledge and play the game in a fashion like a human would. DRAG uses a unified Reinforcement Learning agent and a Unified Reinforcement Learning environment. It only customizes the action space for each game. This generalization is done in the following ways- 1) Record an 8-minute demo video of the game and capture the underlying Android action log. 2) Analyze the recorded video and the action log to generate an action space for the Reinforcement Learning Agent. The unified RL agent is trained by providing it the score and coverage as a reward and screenshots of the game as observed states. We chose a set of 19 different open-sourced games for evaluation of the created tool. These games differ in the action set required by each of them - some require tapping icons, some require swiping in random directions, and some require more complex actions which are a combination of different gestures. The evaluation of our tool outperformed state-of-the-art TimeMachine for all 19 games and outperformed Monkey in 16 of the 19 games. This strengthens the fact that Deep Reinforcement Learning can be used to test Android games and can provide better results than tools that make no use of any domain knowledge.
- Assessing Structure–Property Relationships of Crystal Materials using Deep LearningLi, Zheng (Virginia Tech, 2020-08-05)In recent years, deep learning technologies have received huge attention and interest in the field of high-performance material design. This is primarily because deep learning algorithms in nature have huge advantages over the conventional machine learning models in processing massive amounts of unstructured data with high performance. Besides, deep learning models are capable of recognizing the hidden patterns among unstructured data in an automatic fashion without relying on excessive human domain knowledge. Nevertheless, constructing a robust deep learning model for assessing materials' structure-property relationships remains a non-trivial task due to highly flexible model architecture and the challenge of selecting appropriate material representation methods. In this regard, we develop advanced deep-learning models and implement them for predicting the quantum-chemical calculated properties (i.e., formation energy) for an enormous number of crystal systems. Chapter 1 briefly introduces some fundamental theory of deep learning models (i.e., CNN, GNN) and advanced analysis methods (i.e., saliency map). In Chapter 2, the convolutional neural network (CNN) model is established to find the correlation between the physically intuitive partial electronic density of state (PDOS) and the formation energies of crystals. Importantly, advanced machine learning analysis methods (i.e., salience mapping analysis) are utilized to shed light on underlying physical factors governing the energy properties. In Chapter 3, we introduce the methodology of implementing the cutting-edge graph neural networks (GNN) models for learning an enormous number of crystal structures for the desired properties.
- Assessment of Cyber Vulnerabilities and Countermeasures for GPS-Time Synchronized Measurements in Smart GridsKhan, Imtiaj (Virginia Tech, 2024-07-02)We aim at expanding the horizon of existing research on cyberattacks against the time-syncrhonized devices such as PMUs and PDCs, along with corresponding countermeasures. We develop a PMU-PDC cybersecurity testbed at Virginia Tech Power and Energy Center (PEC) lab. The testbed is able to simulate real-world GPS-spoofing attack (GSA) and false data injection attack (FDIA) scenarios. Moreover, the testbed can incorporates cyberattack detection algorithm in pseudo real-time. After that, we propose three stealthy attack scenarios that exploit the vulnerabilities of time-synchronization for both PMU and PDC. The next part of this dissertation is the enhancement of Hankel-matrix based bad data detection model. The existing general Hankel-matrix based bad data detection model provide satisfactory performance. However, it fails in differentiating GPS-spoofing attack from FDIA. We propose an enhanced phase angle Hankel-matrix model that can conclusively identify GPS-spoofing attack. Furthermore, we reduce the computational burden for Hankel-matrix based bad data and cyberattack detection models. Finally, we verify the effectiveness of our enhanced Hankel-matrix model for proposed stealthy attack scenarios.
- Breaking Privacy in Model-Heterogeneous Federated LearningHaldankar, Atharva Amit (Virginia Tech, 2024-05-14)Federated learning (FL) is a communication protocol that allows multiple distrustful clients to collaboratively train a machine learning model. In FL, data never leaves client devices; instead, clients only share locally computed gradients or model parameters with a central server. As individual gradients may leak information about a given client's dataset, secure aggregation was proposed. With secure aggregation, the server only receives the aggregate gradient update from the set of all sampled clients without being able to access any individual gradient. One challenge in FL is the systems-level heterogeneity that is quite often present among client devices. Specifically, clients in the FL protocol may have varying levels of compute power, on-device memory, and communication bandwidth. These limitations are addressed by model-heterogeneous FL schemes, where clients are able to train on subsets of the global model. Despite the benefits of model-heterogeneous schemes in addressing systems-level challenges, the implications of these schemes on client privacy have not been thoroughly investigated. In this thesis, we investigate whether the nature of model distribution and the computational heterogeneity among client devices in model-heterogeneous FL schemes may result in the server being able to recover sensitive information from target clients. To this end, we propose two novel attacks in the model-heterogeneous setting, even with secure aggregation in place. We call these attacks the Convergence Rate Attack and the Rolling Model Attack. The Convergence Rate Attack targets schemes where clients train on the same subset of the global model, while the Rolling Model Attack targets schemes where model-parameters are dynamically updated each round. We show that a malicious adversary is able to compromise the model and data confidentiality of a target group of clients. We evaluate our attacks on the MNIST dataset and show that using our techniques, an adversary can reconstruct data samples with high fidelity.
- Deepfake Videos in the Wild: Analysis and DetectionPu, Jiameng; Mangaokar, Neal; Kelly, Lauren; Bhattacharya, Parantapa; Sundaram, Kavya; Javed, Mobin; Wang, Bolun; Viswanath, Bimal (ACM, 2021-04)AI-manipulated videos, commonly known as deepfakes, are an emerging problem. Recently, researchers in academia and industry have contributed several (self-created) benchmark deepfake datasets, and deepfake detection algorithms. However, little effort has gone towards understanding deepfake videos in the wild, leading to a limited understanding of the real-world applicability of research contributions in this space. Even if detection schemes are shown to perform well on existing datasets, it is unclear how well the methods generalize to real-world deepfakes. To bridge this gap in knowledge, we make the following contributions: First, we collect and present the largest dataset of deepfake videos in the wild, containing 1,869 videos from YouTube and Bilibili, and extract over 4.8M frames of content. Second, we present a comprehensive analysis of the growth patterns, popularity, creators, manipulation strategies, and production methods of deepfake content in the realworld. Third, we systematically evaluate existing defenses using our new dataset, and observe that they are not ready for deployment in the real-world. Fourth, we explore the potential for transfer learning schemes and competition-winning techniques to improve defenses.
- Defending Against GPS Spoofing by Analyzing Visual CuesXu, Chao (Virginia Tech, 2020-05-21)Massive GPS navigation services are used by billions of people in their daily lives. GPS spoofing is quite a challenging problem nowadays. Existing Anti-GPS spoofing systems primarily focus on expensive equipment and complicated algorithms, which are not practical and deployable for most of the users. In this thesis, we explore the feasibility of a simple text-based system design for Anti-GPS spoofing. The goal is to use the lower cost and make the system more effective and robust for general spoofing attack detection. Our key idea is to only use the textual information from the physical world and build a real-time system to detect GPS spoofing. To demonstrate the feasibility, we first design image processing modules to collect sufficient textual information in panoramic images. Then, we simulate real-world spoofing attacks from two cities to build our training and testing datasets. We utilize LSTM to build a binary classifier which is the key for our Anti-GPS spoofing system. Finally, we evaluate the system performance by simulating driving tests. We prove that our system can achieve more than 98% detection accuracy when the ratio of attacked points in a driving route is more than 50%. Our system has a promising performance for general spoofing attack strategies and it proves the feasibility of using textual information for the spoofing attack detection.
- Defending Against Misuse of Synthetic Media: Characterizing Real-world Challenges and Building Robust DefensesPu, Jiameng (Virginia Tech, 2022-10-07)Recent advances in deep generative models have enabled the generation of realistic synthetic media or deepfakes, including synthetic images, videos, and text. However, synthetic media can be misused for malicious purposes and damage users' trust in online content. This dissertation aims to address several key challenges in defending against the misuse of synthetic media. Key contributions of this dissertation include the following: (1) Understanding challenges with the real-world applicability of existing synthetic media defenses. We curate synthetic videos and text from the wild, i.e., the Internet community, and assess the effectiveness of state-of-the-art defenses on synthetic content in the wild. In addition, we propose practical low-cost adversarial attacks, and systematically measure the adversarial robustness of existing defenses. Our findings reveal that most defenses show significant degradation in performance under real-world detection scenarios, which leads to the second thread of my work: (2) Building detection schemes with improved generalization performance and robustness for synthetic content. Most existing synthetic image detection schemes are highly content-specific, e.g., designed for only human faces, thus limiting their applicability. I propose an unsupervised content-agnostic detection scheme called NoiseScope, which does not require a priori access to synthetic images and is applicable to a wide variety of generative models, i.e., GANs. NoiseScope is also resilient against a range of countermeasures conducted by a knowledgeable attacker. For the text modality, our study reveals that state-of-the-art defenses that mine sequential patterns in the text using Transformer models are vulnerable to simple evasion schemes. We conduct further exploration towards enhancing the robustness of synthetic text detection by leveraging semantic features.
- Defending Against Trojan Attacks on Neural Network-based Language ModelsAzizi, Ahmadreza (Virginia Tech, 2020-05-15)Backdoor (Trojan) attacks are a major threat to the security of deep neural network (DNN) models. They are created by an attacker who adds a certain pattern to a portion of given training dataset, causing the DNN model to misclassify any inputs that contain the pattern. These infected classifiers are called Trojan models and the added pattern is referred to as the trigger. In image domain, a trigger can be a patch of pixel values added to the images and in text domain, it can be a set of words. In this thesis, we propose Trojan-Miner (T-Miner), a defense scheme against such backdoor attacks on text classification deep learning models. The goal of T-Miner is to detect whether a given classifier is a Trojan model or not. To create T-Miner , our approach is based on a sequence-to-sequence text generation model. T-Miner uses feedback from the suspicious (test) classifier to perturb input sentences such that their resulting class label is changed. These perturbations can be different for each of the inputs. T-Miner thus extracts the perturbations to determine whether they include any backdoor trigger and correspondingly flag the suspicious classifier as a Trojan model. We evaluate T-Miner on three text classification datasets: Yelp Restaurant Reviews, Twitter Hate Speech, and Rotten Tomatoes Movie Reviews. To illustrate the effectiveness of T-Miner, we evaluate it on attack models over text classifiers. Hence, we build a set of clean classifiers with no trigger in their training datasets and also using several trigger phrases, we create a set of Trojan models. Then, we compute how many of these models are correctly marked by T-Miner. We show that our system is able to detect trojan and clean models with 97% overall accuracy over 400 classifiers. Finally, we discuss the robustness of T-Miner in the case that the attacker knows T-Miner framework and wants to use this knowledge to weaken T-Miner performance. To this end, we propose four different scenarios for the attacker and report the performance of T-Miner under these new attack methods.
- Detecting Bots using Stream-based System with Data SynthesisHu, Tianrui (Virginia Tech, 2020-05-28)Machine learning has shown great success in building security applications including bot detection. However, many machine learning models are difficult to deploy since model training requires the continuous supply of representative labeled data, which are expensive and time-consuming to obtain in practice. In this thesis, we build a bot detection system with a data synthesis method to explore detecting bots with limited data to address this problem. We collected the network traffic from 3 online services in three different months within a year (23 million network requests). We develop a novel stream-based feature encoding scheme to support our model to perform real-time bot detection on anonymized network data. We propose a data synthesis method to synthesize unseen (or future) bot behavior distributions to enable our system to detect bots with extremely limited labeled data. The synthesis method is distribution-aware, using two different generators in a Generative Adversarial Network to synthesize data for the clustered regions and the outlier regions in the feature space. We evaluate this idea and show our method can train a model that outperforms existing methods with only 1% of the labeled data. We show that data synthesis also improves the model's sustainability over time and speeds up the retraining. Finally, we compare data synthesis and adversarial retraining and show they can work complementary with each other to improve the model generalizability.
- Exploring the Evolution of the TLS Certificate EcosystemFarhan, Syed Muhammad (Virginia Tech, 2022-06-01)A vast majority of popular communication protocols for the internet employ the use of TLS (Transport Layer Security) to secure communication. As a result, there have been numerous efforts including the introduction of Certificate Transparency logs and Free Automated CAs to improve the SSL certificate ecosystem. Our work highlights the effectiveness of these efforts using the Certificate Transparency dataset as well as certificates collected via full IPv4 scans. We show that a large proportion of invalid certificates still exists and outline reasons why these certificates are invalid and where they are hosted. Moreover, we show that the incorrect use of template certificates has led to incorrect SCTs being embedded in the certificates. Taken together, our results emphasize continued involvement for the research community to improve the web's PKI ecosystem.
- A First Look at Toxicity Injection Attacks on Open-domain ChatbotsWeeks, Connor; Cheruvu, Aravind; Abdullah, Sifat Muhammad; Kanchi, Shravya; Yao, Daphne; Viswanath, Bimal (ACM, 2023-12-04)Chatbot systems have improved significantly because of the advances made in language modeling. These machine learning systems follow an end-to-end data-driven learning paradigm and are trained on large conversational datasets. Imperfections or harmful biases in the training datasets can cause the models to learn toxic behavior, and thereby expose their users to harmful responses. Prior work has focused on measuring the inherent toxicity of such chatbots, by devising queries that are more likely to produce toxic responses. In this work, we ask the question: How easy or hard is it to inject toxicity into a chatbot after deployment? We study this in a practical scenario known as Dialog-based Learning (DBL), where a chatbot is periodically trained on recent conversations with its users after deployment. A DBL setting can be exploited to poison the training dataset for each training cycle. Our attacks would allow an adversary to manipulate the degree of toxicity in a model and also enable control over what type of queries can trigger a toxic response. Our fully automated attacks only require LLM-based software agents masquerading as (malicious) users to inject high levels of toxicity. We systematically explore the vulnerability of popular chatbot pipelines to this threat. Lastly, we show that several existing toxicity mitigation strategies (designed for chatbots) can be significantly weakened by adaptive attackers.
- Generative models meet similarity search: efficient, heuristic-free and robust retrievalDoan, Khoa Dang (Virginia Tech, 2021-09-23)The rapid growth of digital data, especially visual and textual contents, brings many challenges to the problem of finding similar data. Exact similarity search, which aims to exhaustively find all relevant items through a linear scan in a dataset, is impractical due to its high computational complexity. Approximate-nearest-neighbor (ANN) search methods, especially the Learning-to-hash or Hashing methods, provide principled approaches that balance the trade-offs between the quality of the guesses and the computational cost for web-scale databases. In this era of data explosion, it is crucial for the hashing methods to be both computationally efficient and robust to various scenarios such as when the application has noisy data or data that slightly changes over time (i.e., out-of-distribution). This Thesis focuses on the development of practical generative learning-to-hash methods and explainable retrieval models. We first identify and discuss the various aspects where the framework of generative modeling can be used to improve the model designs and generalization of the hashing methods. Then we show that these generative hashing methods similarly enjoy several appealing empirical and theoretical properties of generative modeling. Specifically, the proposed generative hashing models generalize better with important properties such as low-sample requirement, and out-of-distribution and data-corruption robustness. Finally, in domains with structured data such as graphs, we show that the computational methods in generative modeling have an interesting utility beyond estimating the data distribution and describe a retrieval framework that can explain its decision by borrowing the algorithmic ideas developed in these methods. Two subsets of generative hashing methods and a subset of explainable retrieval methods are proposed. For the first hashing subset, we propose a novel adversarial framework that can be easily adapted to a new problem domain and three training algorithms that learn the hash functions without several hyperparameters commonly found in the previous hashing methods. The contributions of our work include: (1) Propose novel algorithms, which are based on adversarial learning, to learn the hash functions; (2) Design computationally efficient Wasserstein-related adversarial approaches which have low computational and sample efficiency; (3) Conduct extensive experiments on several benchmark datasets in various domains, including computational advertising, and text and image retrieval, for performance evaluation. For the second hashing subset, we propose energy-based hashing solutions which can improve the generalization and robustness of existing hashing approaches. The contributions of our work for this task include: (1) Propose data-synthesis solutions to improve the generalization of existing hashing methods; (2) Propose energy-based hashing solutions which exhibit better robustness against out-of-distribution and corrupted data; (3) Conduct extensive experiments for performance evaluations on several benchmark datasets in the image retrieval domain. Finally, for the last subset of explainable retrieval methods, we propose an optimal alignment algorithm that achieves a better similarity approximation for a pair of structured objects, such as graphs, while capturing the alignment between the nodes of the graphs to explain the similarity calculation. The contributions of our work for this task include: (1) Propose a novel optimal alignment algorithm for comparing two sets of bag-of-vectors embeddings; (2) Propose a differentiable computation to learn the parameters of the proposed optimal alignment model; (3) Conduct extensive experiments, for performance evaluation of both the similarity approximation task and the retrieval task, on several benchmark graph datasets.
- InjectBench: An Indirect Prompt Injection Benchmarking FrameworkKong, Nicholas Ka-Shing (Virginia Tech, 2024-08-20)The integration of large language models (LLMs) with third party applications has allowed for LLMs to retrieve information from up-to-date or specialized resources. Although this integration offers numerous advantages, it also introduces the risk of indirect prompt injection attacks. In such scenarios, an attacker embeds malicious instructions within the retrieved third party data, which when processed by the LLM, can generate harmful and untruthful outputs for an unsuspecting user. Although previous works have explored how these attacks manifest, there is no benchmarking framework to evaluate indirect prompt injection attacks and defenses at scale, limiting progress in this area. To address this gap, we introduce InjectBench, a framework that empowers the community to create and evaluate custom indirect prompt injection attack samples. Our study demonstrate that InjectBench has the capabilities to produce high quality attack samples that align with specific attack goals, and that our LLM evaluation method aligns with human judgement. Using InjectBench, we investigate the effects of different components of an attack sample on four LLM backends, and subsequently use this newly created dataset to do preliminary testing on defenses against indirect prompt injections. Experiment results suggest that while more capable models are susceptible to attacks, they are better equipped at utilizing defense strategies. To summarize, our work helps the research community to systematically evaluate features of attack samples and defenses by introducing a dataset creation and evaluation framework.
- A Measurement Approach to Understanding the Data Flow of Phishing From Attacker and Defender PerspectivesPeng, Peng (Virginia Tech, 2020-01-10)Phishing has been a big concern due to its active roles in recent data breaches and state- sponsored attacks. While existing works have extensively analyzed phishing websites and detection methods, there is still a limited understanding of the data flow of the phishing process. In this thesis, we perform an empirical measurement to draw a clear picture of the data flow of phishing from both attacker and defender perspectives. First, from attackers' perspective, we want to know how attackers collect the sensitive information stolen from victims throughout the end-to-end phishing attack process. So we collected more than 179,000 real-world phishing URLs. Then we build a measurement tool to feed fake credentials to live phishing sites and monitor how the credential information is shared with the phishing server and potentially third-party collectors on the client side. Besides, we also obtain phishing kits to analyze how credentials are sent to attackers and third-parties on the server side. Then, from defenders' perspective, online scan engines such as VirusTotal are heavily used by phishing defenders to label phishing URLs, however, the data flow behind phishing detection by those scan engines is still unclear. So we build our own phishing websites, submit them to VirusTotal for scanning, to understand how VirusTotal works and the quality of its labels. Our study reveals the key mechanisms for information sharing during phishing attacks and the need for developing more rigorous methodologies to assess and make use of the labels obtained from VirusTotal.
- Measurement of Embedding Choices on Cryptographic API Completion TasksXiao, Ya; Song, Wenjia; Ahmed, Salman; Ge, Xinyang; Viswanath, Bimal; Meng, Na; Yao, Danfeng (ACM, 2023-10)In this paper, we conduct a measurement study to comprehensively compare the accuracy impacts of multiple embedding options in cryptographic API completion tasks. Embedding is the process of automatically learning vector representations of program elements. Our measurement focuses on design choices of three important aspects, program analysis preprocessing, token-level embedding, and sequence-level embedding. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.20% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. With program analysis and the token-level embedding training, the embedding dep2vec improves the task accuracy from 55.80% to 92.04%. Moreover, only a slight accuracy advantage (0.55% on average) is observed by training the expensive sequence-level embedding compared with the token-level embedding. Our experiments also suggest the differences made by the data. In the cross-app learning setup and a data scarcity scenario, sequence-level embedding is more necessary and results in a more obvious accuracy improvement (5.10%)
- Measuring and Understanding TTL Violations in DNS ResolversBhowmick, Protick (Virginia Tech, 2024-01-02)The Domain Name System (DNS) is a scalable-distributed caching architecture where each DNS records are cached around several DNS servers distributed globally. DNS records include a time-to-live (TTL) value that dictates how long the record can be stored before it's evicted from the cache. TTL holds significant importance in aspects of DNS security, such as determining the caching period for DNSSEC-signed responses, as well as performance, like the responsiveness of CDN-managed domains. On a high level, TTL is crucial for ensuring efficient caching, load distribution, and network security in Domain Name System. Setting appropriate TTL values is a key aspect of DNS administration to ensure the reliable and efficient functioning of the Domain Name System. Therefore, it is crucial to measure how TTL violations occur in resolvers. But, assessing how DNS resolvers worldwide handle TTL is not easy and typically requires access to multiple nodes distributed globally. In this work, we introduce a novel methodology for measuring TTL violations in DNS resolvers leveraging a residential proxy service called Brightdata, enabling us to evaluate more than 27,000 resolvers across 9,500 Autonomous Systems (ASes). We found that 8.74% arbitrarily extends TTL among 8,524 resolvers that had atleast five distinct exit nodes. Additionally, we also find that the DNSSEC standard is being disregarded by 44.1% of DNSSEC-validating resolvers, as they continue to provide DNSSEC-signed responses even after the RRSIGs have expired.
- Neural Cryptanalysis for Cyber-Physical System CiphersMeno, Emma Margaret (Virginia Tech, 2021-05-18)A key cryptographic research interest is developing an automatic, black-box method to provide a relative security strength measure for symmetric ciphers, particularly for proprietary cyber-physical systems (CPS) and lightweight block ciphers. This thesis work extends the work of the recently-developed neural cryptanalysis method, which trains neural networks on a set of plaintext/ciphertext pairs to extract meaningful bitwise relationships and predict corresponding ciphertexts given a set of plaintexts. As opposed to traditional cryptanalysis, the goal is not key recovery but achieving a mimic accuracy greater than a defined base match rate. In addition to reproducing tests run with the Data Encryption Standard, this work applies neural cryptanalysis to round-reduced versions and components of the SIMON/SPECK family of block ciphers and the Advanced Encryption Standard. This methodology generated a metric able to rank the relative strengths of rounds for each cipher as well as algorithmic components within these ciphers. Given the current neural network suite tested, neural cryptanalysis is best-suited for analyzing components of ciphers rather than full encryption models. If these models are improved, this method presents a promising future in measuring the strength of lightweight symmetric ciphers, particularly for CPS.
- NoiseLearner: An Unsupervised, Content-agnostic Approach to Detect Deepfake ImagesVives, Cristian (Virginia Tech, 2022-03-21)Recent advancements in generative models have resulted in the improvement of hyper- realistic synthetic images or "deepfakes" at high resolutions, making them almost indistin- guishable from real images from cameras. While exciting, this technology introduces room for abuse. Deepfakes have already been misused to produce pornography, political propaganda, and misinformation. The ability to produce fully synthetic content that can cause such mis- information demands for robust deepfake detection frameworks. Most deepfake detection methods are trained in a supervised manner, and fail to generalize to deepfakes produced by newer and superior generative models. More importantly, such detection methods are usually focused on detecting deepfakes having a specific type of content, e.g., face deepfakes. How- ever, other types of deepfakes are starting to emerge, e.g., deepfakes of biomedical images, satellite imagery, people, and objects shown in different settings. Taking these challenges into account, we propose NoiseLearner, an unsupervised and content-agnostic deepfake im- age detection method. NoiseLearner aims to detect any deepfake image regardless of the generative model of origin or the content of the image. We perform a comprehensive evalu- ation by testing on multiple deepfake datasets composed of different generative models and different content groups, such as faces, satellite images, landscapes, and animals. Further- more, we include more recent state-of-the-art generative models in our evaluation, such as StyleGAN3 and probabilistic denoising diffusion models (DDPM). We observe that Noise- Learner performs well on multiple datasets, achieving 96% accuracy on both StyleGAN and StyleGAN2 datasets.
- On the Use of Containers in High Performance ComputingAbraham, Subil (Virginia Tech, 2020-07-09)The lightweight, portable, and flexible nature of containers is driving their widespread adoption in cloud solutions. Data analysis and deep learning applications have especially benefited from containerized solutions. As such data analysis is also being utilized in the high performance computing (HPC) domain, the need for container support in HPC has become paramount. However, container adoption in HPC face crucial performance and I/O challenges. One obstacle is that while there have been container solutions for HPC, such solutions have not been thoroughly investigated, especially from the aspect of their impact on the crucial I/O throughput needs of HPC. To this end, this paper provides a first-of-its-kind empirical analysis of state-of-the-art representative container solutions (Docker, Podman, Singularity, and Charliecloud) in HPC environments, especially how containers interact with the HPC storage systems. We present the design of an analysis framework that is deployed on all nodes in an HPC environment, and captures aspects such as CPU, memory, network, and file I/O statistics from the nodes and the storage system. We are able to garner key insights from our analysis, e.g., Charliecloud outperforms other container solutions in terms of container start-up time, while Singularity and Charliecloud are equivalent in I/O throughput. But this comes at a cost, as Charliecloud invokes the most metadata and I/O operations on the underlying Lustre file system. By identifying such optimization opportunities, we can enhance performance of containers atop HPC and help the aforementioned applications.