Browsing by Author "Chantem, Thidapat"
Now showing 1 - 20 of 32
Results Per Page
Sort Options
- Anomaly Detection for Smart Infrastructure: An Unsupervised Approach for Time Series ComparisonGandra, Harshitha (Virginia Tech, 2022-01-25)Time series anomaly detection can prove to be a very useful tool to inspect and maintain the health and quality of an infrastructure system. While tackling such a problem, the main concern lies in the imbalanced nature of the dataset. In order to mitigate this problem, this thesis proposes two unsupervised anomaly detection frameworks. The first one is an architecture which leverages the concept of matrix profile which essentially refers to a data structure containing the euclidean scores of the subsequences of two time series that is obtained through a similarity join.It is an architecture comprising of a data fusion technique coupled with using matrix profile analysis under the constraints of varied sampling rate for different time series. To this end, we have proposed a framework, through which a time series that is being evaluated for anomalies is quantitatively compared with a benchmark (anomaly-free) time series using the proposed asynchronous time series comparison that was inspired by matrix profile approach for anomaly detection on time series . In order to evaluate the efficacy of this framework, it was tested on a case study comprising of a Class I Rail road dataset. The data collection system integrated into this railway system collects data through different data acquisition channels which represent different transducers. This framework was applied to all the channels and the best performing channels were identified. The average Recall and Precision achieved on the single channel evaluation through this framework was 93.5% and 55% respectively with an error threshold of 0.04 miles or 211 feet. A limitation that was noticed in this framework was that there were some false positive predictions. In order to overcome this problem, a second framework has been proposed which incorporates the idea of extracting signature patterns in a time series also known as motifs which can be leveraged to identify anomalous patterns. This second framework proposed is a motif based framework which operates under the same constraints of a varied sampling rate. Here, a feature extraction method and a clustering method was used in the training process of a One Class Support Vector Machine (OCSVM) coupled with a Kernel Density Estimation (KDE) technique. The average Recall and Precision achieved on the same case study through this frame work was 74% and 57%. In comparison to the first, the second framework does not perform as well. There will be future efforts focused on improving this classification-based anomaly detection method
- Architecting IoT-Enabled Smart Building TestbedAmanzadeh, Leila (Virginia Tech, 2018-10-29)Smart building's benefits range from improving comfort of occupant, increased productivity, reduction in energy consumption and operating costs, lower CO2 emission, to improved life cycle of utilities, efficient operation of building systems, etc. [65]. Hence, modern building owners are turning towards smart buildings. However, the current smart buildings mostly are not capable of achieving the objectives they are designed for and they can improve a lot better [22]. Therefore, a new technology called, Internet of Things, or IoT, is combined with the smart buildings to improve their performance [23]. IoT is the inter-networking of things embedded with electronics, software, sensors, actuators, and network connectivity to collect and exchange data, and things in this definition is anything and everything around us and even ourselves. Using this technology, e.g. a door can be a thing and can sense how many people have passed it's sensor to enter a space and let the lighting system know to prepare appropriate amount of light, or the HVAC (Heating Ventilation Air Conditioning) system to provide desirable temperature. IoT will provide a lot of useful information that before that accessibility to it was impossible, e.g., condition of water pipes in winter, which helps avoiding damages like frozen or broken pipes. However, despite all the benefits, IoT suffers from being vulnerable to cyber attacks. Examples have been provided later in Chapter 1. In this project among building systems, HVAC system is chosen to be automated with a new control method called MPC (Model Predictive Control). This method is fast, very energy efficient and has a lower than 0.001 rate of error for regulating the space temperature to any temperature that the occupants desire according to the results of this project. Furthermore, a PID (Proportional–Integral–Derivative) controller has been designed for the HVAC system that in the exact same cases MPC shows a much better performance. To design controllers for HVAC system and set the temperature to the desired value a method to automate balancing the heat flow should be found, therefore a thermal model of building should be available that using this model, the amount of heat, flowing in and out of a space in the building disregarding the external weather would be known to estimate. To automate the HVAC system using the programming languages like MATLAB, there is a need to convert the thermal model of the building to a mathematical model. This mathematical model is unique for each building depending on how many floors it has, how wide it is, and what materials have been used to construct the building. This process is needs a lot of effort and time even for buildings with 2 floors and 2 rooms on each floor and at the end the engineer might have done it with error. In this project you will see a software that will do the conversion of thermal model of buildings in any size to their mathematical model automatically, which helps improving the HVAC controllers to set temperature to the value occupants desire and avoid errors and time loss which is put both into calculations and troubleshooting. In addition, a test environment has been designed and constructed as a cyber physical system that allows us to test the IoT- enabled control systems before implementing them on real buildings, observe the performance, and decide if the system is satisfying or not. Also, all cyber threats can be explored and the solutions to those attacks can be evaluated. Even for the systems that are already out there, there is an opportunity to be assessed on this testbed and if there is any vulnerability in case of cyber security, solutions would be evaluated and help the existing systems improve.
- Automated Analysis of Astrocyte Activities from Large-scale Time-lapse Microscopic Imaging DataWang, Yizhi (Virginia Tech, 2019-12-13)The advent of multi-photon microscopes and highly sensitive protein sensors enables the recording of astrocyte activities on a large population of cells over a long-time period in vivo. Existing tools cannot fully characterize these activities, both within single cells and at the population-level, because of the insufficiency of current region-of-interest-based approaches to describe the activity that is often spatially unfixed, size-varying, and propagative. Here, we present Astrocyte Quantitative Analysis (AQuA), an analytical framework that releases astrocyte biologists from the ROI-based paradigm. The framework takes an event-based perspective to model and accurately quantify the complex activity in astrocyte imaging datasets, with an event defined jointly by its spatial occupancy and temporal dynamics. To model the signal propagation in astrocyte, we developed graphical time warping (GTW) to align curves with graph-structured constraints and integrated it into AQuA. To make AQuA easy to use, we designed a comprehensive software package. The software implements the detection pipeline in an intuitive step by step GUI with visual feedback. The software also supports proof-reading and the incorporation of morphology information. With synthetic data, we showed AQuA performed much better in accuracy compared with existing methods developed for astrocytic data and neuronal data. We applied AQuA to a range of ex vivo and in vivo imaging datasets. Since AQuA is data-driven and based on machine learning principles, it can be applied across model organisms, fluorescent indicators, experimental modes, and imaging resolutions and speeds, enabling researchers to elucidate fundamental astrocyte physiology.
- Automated Identification and Tracking of Motile Oligodendrocyte Precursor Cells (OPCs) from Time-lapse 3D Microscopic Imaging Data of Cell Clusters in vivoWang, Yinxue (Virginia Tech, 2021-06-02)Advances in time-lapse 3D in vivo fluorescence microscopic imaging techniques enables the observation and investigation into the migration of Oligodendrocyte precursor cells (OPCs) and its role in the central nervous system. However, current practice of image-based OPC motility analysis heavily relies on manual labeling and tracking on 2D max projection of the 3D data, which suffers from massive human labor, subjective biases, weak reproducibility and especially information loss and distortion. Besides, due to the lack of OPC specific genetically encoded indicator, OPCs can only be identified from other oligodendrocyte lineage cells by their observed motion patterns. Automated analytical tools are needed for the identification and tracking of OPCs. In this dissertation work, we proposed an analytical framework, MicTracker (Migrating Cell Tracker), for the integrated task of identifying, segmenting and tracking migrating cells (OPCs) from in vivo time-lapse fluorescence imaging data of high-density cell clusters composed of cells with different modes of motions. As a component of the framework, we presented a novel strategy for cell segmentation with global temporal consistency enforced, tackling the challenges caused by highly clustered cell population and temporally inconsistently blurred boundaries between touching cells. We also designed a data association algorithm to address the violation of usual assumption of small displacements. Recognizing that the violation was in the mixed cell population composed of two cell groups while the assumption held within each group, we proposed to solve the seemingly impossible mission by de-mixing the two groups of cell motion modes without known labels. We demonstrated the effectiveness of MicTracker in solving our problem on in vivo real data.
- Brain Signal Quantification and Functional Unit Analysis in Fluorescent Imaging Data by Unsupervised LearningMi, Xuelong (Virginia Tech, 2024-06-04)Optical recording of various brain signals is becoming an indispensable technique for biological studies, accelerated by the development of new or improved biosensors and microscopy technology. A major challenge in leveraging the technique is to identify and quantify the rich patterns embedded in the data. However, existing methods often struggle, either due to their limited signal analysis capabilities or poor performance. Here we present Activity Quantification and Analysis (AQuA2), an innovative analysis platform built upon machine learning theory. AQuA2 features a novel event detection pipeline for precise quantification of intricate brain signals and incorporates a Consensus Functional Unit (CFU) module to explore interactions among potential functional units driving repetitive signals. To enhance efficiency, we developed BIdirectional pushing with Linear Component Operations (BILCO) algorithm to handle propagation analysis, a time-consuming step using traditional algorithms. Furthermore, considering user-friendliness, AQuA2 is implemented as both a MATLAB package and a Fiji plugin, complete with a graphical interface for enhanced usability. AQuA2's validation through both simulation and real-world applications demonstrates its superior performance compared to its peers. Applied across various sensors (Calcium, NE, and ATP), cell types (astrocytes, oligodendrocytes, and neurons), animal models (zebrafish and mouse), and imaging modalities (two-photon, light sheet, and confocal), AQuA2 consistently delivers promising results and novel insights, showcasing its versatility in fluorescent imaging data analysis.
- Conflict, Paradox, and the Role of Structure in True IntelligenceBettendorf, Isaac T. (Virginia Tech, 2024-04-04)Novel forms of brain-inspired programming models related to novel computer architecture are required to both understand the mysteries of intelligence as well as break barriers in computational complexity, and computer parallelism. Artificial Intelligence is focused on developing complex programs based on abstract, statistical prediction engines that require large datasets, vast amounts of computational power, and unbounded computation time. By contrast, the brain utilizes relatively few experiences to make decisions in unpredictable, time-constrained situations while utilizing relatively small amounts of physical computational space and power with high degrees of complexity and parallelism. We observe that intelligence requires the accommodation of ambiguity, conflict, and paradox. From a structural perspective, this means the same set of inputs leads to conflicting results that are likely produced in isolated regions of the brain that function independently until an answer must be chosen. We further observe that, unlike computer programs, brains constantly interact with the physical world where external constraints force the selection of the best available response in time-quality trade-offs ranging from fight-or-flight to deep thinking. For example, when intelligent beings are presented with a set of inputs, those inputs can be processed with different levels of thinking, utilizing heterogeneous algorithms to produce answers dependent upon the time available to process them. We introduce the Troop meta-approach, which is a novel meta computer architecture and programming. Experiments demonstrate our approach in modeling conflict when the same set of inputs are heterogeneously processed independently using maze solving and ordered search in real-world environments with unpredictable, random time constraints. Across one hundred trials, on average, the Troop solution solves mazes almost six times faster than the only other solution, which does not accommodate conflict but can always produce a result when required. Two other experiments based on ordered search show that, on average, the Troop solution returns a position that is over twice as accurate as the other solutions which do not accommodate conflict but always produce a result when required. This work lays the foundation for more research in algorithms that utilize time-accuracy trade-offs consistent with our approach.
- Deadline-Aware Task Offloading for Vehicular Edge Computing Networks using Traffic Lights DataOza, Pratham; Hudson, Nathaniel; Chantem, Thidapat; Khamfroush, Hana (ACM, 2023)As vehicles becomes increasingly automated, novel vehicular applications are emerging to enhance the safety and security of the vehicles and improve user experience. This brings ever-increasing data and resource requirements for timely computation on the vehicle's on-board computing systems. To alleviate these demands, deploying vehicular edge computing (VEC) onto the road-side units (RSUs) have been proposed in prior work where vehicles can offload compute intensive tasks. Due to limited communication range of RSUs, vehicular movements influenced by traffic conditions impact the communication between the vehicles and the RSUs and can increase the response times of the offloaded applications. Existing task offloading strategies do not consider the influence of traffic lights on vehicular mobility while offloading workloads on the RSUs, and thereby cause deadline misses and quality-of-service (QoS) reduction. In this paper, we present a novel task model that captures time and location-specific requirements for vehicular applications. We then present a deadline-based strategy that incorporates traffic light data to opportunistically offload tasks. Our approach allows up to 33% more tasks to be offloaded onto the RSUs, compared to existing work, without causing any deadline misses and thereby maximizing the resource utilization on the RSUs.
- Defending Real-Time Systems through Timing-Aware DesignsMishra, Tanmaya (Virginia Tech, 2022-05-04)Real-time computing systems are those that are designed to achieve computing goals by certain deadlines. Real-time computing systems are present in everything from cars to airplanes, pacemakers to industrial-control systems, and other pieces of critical infrastructure. With the increasing interconnectivity of these systems, system security issues and the constant threat of manipulation by malicious external attackers that have plagued general computing systems, now threaten the integrity and safety of real-time systems. This dissertation discusses three different defense techniques that focuses on the role that real-time scheduling theory can play to reduce runtime cost, and guarantee correctness when applying these defense strategies to real-time systems. The first work introduces a novel timing aware defense strategy for the CAN bus that utilizes TrustZone on state-of-the-art ARMv8-M microcontrollers. The second reduces the runtime cost of control-flow integrity (CFI), a popular system security defense technique, by correctly modeling when a real-time system performs I/O, and exploiting the model to schedule CFI procedures efficiently. Finally, the third studies and provides a lightweight mitigation strategy for a recently discovered vulnerability within mixed criticality real-time systems.
- Designing Security Defenses for Cyber-Physical SystemsForuhandeh, Mahsa (Virginia Tech, 2022-05-04)Legacy cyber-physical systems (CPSs) were designed without considering cybersecurity as a primary design tenet especially when considering their evolving operating environment. There are many examples of legacy systems including automotive control, navigation, transportation, and industrial control systems (ICSs), to name a few. To make matters worse, the cost of designing and deploying defenses in existing legacy infrastructure can be overwhelming as millions or even billions of legacy CPS systems are already in use. This economic angle, prevents the use of defenses that are not backward compatible. Moreover, any protection has to operate efficiently in resource constraint environments that are dynamic nature. Hence, the existing approaches that require ex- pensive additional hardware, propose a new protocol from scratch, or rely on complex numerical operations such as strong cryptographic solutions, are less likely to be deployed in practice. In this dissertation, we explore a variety of lightweight solutions for securing different existing CPSs without requiring any modifications to the original system design at hardware or protocol level. In particular, we use fingerprinting, crowdsourcing and deterministic models as alternative backwards- compatible defenses for securing vehicles, global positioning system (GPS) receivers, and a class of ICSs called supervisory control and data acquisition (SCADA) systems, respectively. We use fingerprinting to address the deficiencies in automobile cyber-security from the angle of controller area network (CAN) security. CAN protocol is the de-facto bus standard commonly used in the automotive industry for connecting electronic control units (ECUs) within a vehicle. The broadcast nature of this protocol, along with the lack of authentication or integrity guarantees, create a foothold for adversaries to perform arbitrary data injection or modification and impersonation attacks on the ECUs. We propose SIMPLE, a single-frame based physical layer identification for intrusion detection and prevention on such networks. Physical layer identification or fingerprinting is a method that takes advantage of the manufacturing inconsistencies in the hardware components that generate the analog signal for the CPS of our interest. It translates the manifestation of these inconsistencies, which appear in the analog signals, into unique features called fingerprints which can be used later on for authentication purposes. Our solution is resilient to ambient temperature, supply voltage value variations, or aging. Next, we use fingerprinting and crowdsourcing at two separate protection approaches leveraging two different perspectives for securing GPS receivers against spoofing attacks. GPS, is the most predominant non-authenticated navigation system. The security issues inherent into civilian GPS are exacerbated by the fact that its design and implementation are public knowledge. To address this problem, first we introduce Spotr, a GPS spoofing detection via device fingerprinting, that is able to determine the authenticity of signals based on their physical-layer similarity to the signals that are known to have originated from GPS satellites. More specifically, we are able to detect spoofing activities and track genuine signals over different times and locations and propagation effects related to environmental conditions. In a different approach at a higher level, we put forth Crowdsourcing GPS, a total solution for GPS spoofing detection, recovery and attacker localization. Crowdsourcing is a method where multiple entities share their observations of the environment and get together as a whole to make a more accurate or reliable decision on the status of the system. Crowdsourcing has the advantage of deployment with the less complexity and distributed cost, however its functionality is dependent on the adoption rate by the users. Here, we have two methods for implementing Crowdsourcing GPS. In the first method, the users in the crowd are aware of their approximate distance from other users using Bluetooth. They cross validate this approximate distance with the GPS-derived distance and in case of any discrepancy they report ongoing spoofing activities. This method is a strong candidate when the users in the crowd have a sparse distribution. It is also very effective when tackling multiple coordinated adversaries. For method II, we exploit the angular dispersion of the users with respect to the direction that the adversarial signal is being transmitted from. As a result, the users that are not facing the attacker will be safe. The reason for this is that human body mostly comprises of water and absorbs the weak adversarial GPS signal. The safe users will help the spoofed users find out that there is an ongoing attack and recover from it. Additionally, the angular information is used for localizing the adversary. This method is slightly more complex, and shows the best performance in dense areas. It is also designed based on the assumption that the spoofing attack is only terrestrial. Finally, we propose a tandem IDS to secure SCADA systems. SCADA systems play a critical role in most safety-critical infrastructures of ICSs. The evolution of communications technology has rendered modern SCADA systems and their connecting actuators and sensors vulnerable to malicious attacks on both physical and application layers. The conventional IDS that are built for securing SCADA systems are focused on a single layer of the system. With the tandem IDS we break this habit and propose a strong multi-layer solution which is able to expose a wide range of attack. To be more specific, the tandem IDS comprises of two parts, a traditional network IDS and a shadow replica. We design the shadow replica as a deterministic IDS. It performs a workflow analysis and makes sure the logical flow of the events in the SCADA controller and its connected devices maintain their expected states. Any deviation would be a malicious activity or a reliability issue. To model the application level events, we leverage finite state machines (FSMs) to compute the anticipated states of all of the devices. This is feasible because in many of the existing ICSs the flow of traffic and the resulting states and actions in the connected devices have a deterministic nature. Consequently, it leads to a reliable and free of uncertainty solution. Aside from detecting traditional network attacks, our approach bypasses the attacker in case it succeeds in taking over the devices and also maintains continuous service if the SCADA controller gets compromised.
- An Efficient Knapsack-Based Approach for Calculating the Worst-Case Demand of AVR TasksBijinemula, Sandeep Kumar (Virginia Tech, 2019-02-01)Engine-triggered tasks are real-time tasks that are released when the crankshaft arrives at certain positions in its path of rotation. This makes the rate of release of these jobs a function of the crankshaft's angular speed and acceleration. In addition, several properties of the engine triggered tasks like the execution time and deadlines are dependent on the speed profile of the crankshaft. Such tasks are referred to as adaptive-variable rate (AVR) tasks. Existing methods to calculate the worst-case demand of AVR tasks are either inaccurate or computationally intractable. We propose a method to efficiently calculate the worst-case demand of AVR tasks by transforming the problem into a variant of the knapsack problem. We then propose a framework to systematically narrow down the search space associated with finding the worst-case demand of AVR tasks. Experimental results show that our approach is at least 10 times faster, with an average runtime improvement of 146 times for randomly generated task sets when compared to the state-of-the-art technique.
- An Empirical Method of Ascertaining the Null Points from a Dedicated Short-Range Communication (DSRC) Roadside Unit (RSU) at a Highway On/Off-RampWalker, Jonathan Bearnarr (Virginia Tech, 2018-09-26)The deployment of dedicated short-range communications (DSRC) roadside units (RSUs) allows a connected or automated vehicle to acquire information from the surrounding environment using vehicle-to-infrastructure (V2I) communication. However, wireless communication using DSRC has shown to exhibit null points, at repeatable distances. The null points are significant and there was unexpected loss in the wireless signal strength along the pathway of the V2I communication. If the wireless connection is poor or non-existent, the V2I safety application will not obtain sufficient data to perform the operation services. In other words, a poor wireless connection between a vehicle and infrastructure (e.g., RSU) could hamper the performance of a safety application. For example, a designer of a V2I safety application may require a minimum rate of data (or packet count) over 1,000 meters to effectively implement a Reduced Speed/Work Zone Warning (RSZW) application. The RSZW safety application is aimed to alert or warn drivers, in a Cooperative Adaptive Cruise Control (CACC) platoon, who are approaching a work zone. Therefore, the packet counts and/or signal strength threshold criterion must be determined by the developer of the V2I safety application. Thus, we selected an arbitrary criterion to develop an empirical method of ascertaining the null points from a DSRC RSU. The research motivation focuses on developing an empirical method of calculating the null points of a DSRC RSU for V2I communication at a highway on/off-ramp. The intent is to improve safety, mobility, and environmental applications since a map of the null points can be plotted against the distance between the DSRC RSU and a vehicle's onboard unit (OBU). The main research question asks: 'What is a more robust empirical method, compared to the horizontal and vertical laws of reflection formula, in determining the null points from a DSRC RSU on a highway on/off ramp?' The research objectives are as follows: 1. Explain where and why null points occur from a DSRC RSU (Chapter 2) 2. Apply the existing horizontal and vertical polarization model and discuss the limitations of the model in a real-world scenario for a DSRC RSU on a highway on/off ramp (Chapter 3 and Appendix A) 3. Introduce an extended horizontal and vertical polarization null point model using empirical data (Chapter 4) 4. Discuss the conclusion, limitations of work, and future research (Chapter 5). The simplest manner to understand where and why null points occur is depicted as two sinusoidal waves: direct and reflective waves (i.e., also known as a two-ray model). The null points for a DSRC RSU occurs because the direct and reflective waves produce a destructive interference (i.e., decrease in signal strength) when they collide. Moreover, the null points can be located using Pythagorean theorem for the direct and reflective waves. Two existing models were leveraged to analyze null points: 1) signal strength loss (i.e., a free space path loss model, or FSPL, in Appendix A) and 2) the existing horizontal and vertical polarization null points from a DSRC RSU. Using empirical data from two different field tests, the existing horizontal and vertical polarization null point model was shown to contain limitations in short distances from the DSRC RSU. Moreover, the existing horizontal and vertical polarization model for null points was extremely challenging to replicate with over 15 DSRC RSU data sets. After calculating the null point for several DSRC RSU heights, the paper noticed a limitation of the existing horizontal and vertical polarization null point model with over 15 DSRC RSU data sets (i.e., the model does not account for null points along the full length of the FSPL model). An extended horizontal and vertical polarization model is proposed that calculates the null point from a DSRC RSU. There are 18 model comparisons of the packet counts and signal strengths at various thresholds as perspective extended horizontal and vertical polarization models. This paper compares the predictive ability of 18 models and measures the fit. Finally, a predication graph is depicted with the neural network's probability profile for packet counts =1 when greater than or equal to 377. Likewise, a python script is provided of the extended horizontal and vertical polarization model in Appendix C. Consequently, the neural network model was applied to 10 different DSRC RSU data sets at 10 unique locations around a circular test track with packet counts ranging from 0 to 11. Neural network models were generated for 10 DSRC RSUs using three thresholds with an objective to compare the predictive ability of each model and measure the fit. Based on 30 models at 10 unique locations, the highest misclassification was 0.1248, while the lowest misclassification was 0.000. There were six RSUs mounted at 3.048 (or 10 feet) from the ground with a misclassification rate that ranged from 0.1248 to 0.0553. Out of 18 models, seven had a misclassification rate greater than 0.110, while the remaining misclassification rates were less than 0.0993. There were four RSUs mounted at 6.096 meters (or 20 feet) from the ground with a misclassification rate that ranged from 0.919 to 0.000. Out of 12 models, four had a misclassification rate greater than 0.0590, while the remaining misclassification rates were less than 0.0412. Finally, there are two major limitations in the research: 1) the most effective key parameter is packet counts, which often require expensive data acquisition equipment to obtain the information and 2) the categorical type (i.e., decision tree, logistic regression, and neural network) will vary based on the packet counts or signal strength threshold that is dictated by the threshold criterion. There are at least two future research areas that correspond to this body of work: 1) there is a need to leverage the extended horizontal and vertical polarization null point model on multiple DSRC RSUs along a highway on/off ramp, and 2) there is a need to apply and validate different electric and magnetic (or propagation) models.
- HetMigrate: Secure and Efficient Cross-architecture Process Live MigrationBapat, Abhishek Mandar (Virginia Tech, 2023-01-31)The slowdown of Moore's Law opened a new era of computer research and development. Researchers started exploring alternatives to the traditional CPU design. A constant increase in consumer demands led to the development of CMPs, GPUs, and FPGAs. Recent research proposed the development of heterogeneous-ISA systems and implemented the necessary systems software to make such systems functional. Evaluations have shown that heterogeneous-ISA systems can offer better throughput and energy efficiency than homogeneous-ISA systems. Due to their low cost, ARM servers are now being adopted in data centers (e.g., AWS Graviton). While prior work provided the infrastructure necessary to run applications on heterogeneous-ISA systems, their dependency on a specialized kernel and a custom compiler increases deployment and maintenance costs. This thesis presents HetMigrate, a framework to live-migrate Linux processes over heterogeneous-ISA systems. HetMigrate integrates with CRIU, a Linux mechanism for process migration, and runs on stock Linux operating systems which improves its deployability. Furthermore, HetMigrate transforms the process's state externally without instrumenting state transformation code into the process binaries which has security benefits and also improves deployability. Our evaluations on Redis server and NAS Parallel Benchmarks show that HetMigrate takes an average of 720ms to fully migrate a process across ISAs while maintaining its state. Moreover, live-migrating with HetMigrate reduces the attack surface of a process by up to 72.8% compared to prior work. Additionally, HetMigrate is easier to deploy in real-world systems compared to prior work. To prove the deployability we ran HetMigrate on a variety of environments like cloud instances (e.g. Cloud Lab), local setups virtualized with QEMU/KVM, and a server-embedded board pair. Similar to works in the past, we also evaluated the energy and throughput benefits that heterogeneous-ISA systems can offer by connecting a Xeon server to three embedded boards over the network. We observed that selectively offloading compute-intensive workloads to embedded boards can increase energy efficiency by up to 39% and throughput by up to 52% while increasing the cost by just 10%.
- An Integrated Real-Time and Security Scheduling Framework for CPSKansal, Kriti (Virginia Tech, 2023-05-18)In the world of real-time systems (RTS), security has often been overlooked in the design process. However, with the emergence of the Internet of Things and Cyber-Physical Systems, RTS are now frequently used in interconnected applications where data is shared regularly. Unfortunately, this increased connectivity has also led to a larger attack surface. As a result, it is crucial to redesign RTS to not only meet real-time requirements but also to be resilient to threats. To address this issue, we propose a new real-time security co-design task model, and an accompanying scheduling framework, where schedulability can be used to indicate whether both real-time and security requirements are met. Our algorithm is designed to be flexible, allowing different security mechanisms to be used along with real-time tasks. Specifically, we augment the frame-based task model by introducing an n-dimensional security matrix, which serves as a powerful tool to enable our approach. This matrix clearly indicates which defense mechanisms are available for each task in the system by storing the worst-case execution times of tasks. Then, we transform the problem of maximizing security, subject to schedulability, into a variant of the knapsack problem. To make this approach more practical, we implement a fully polynomial time approximation scheme (FPTAS) that reduces the time complexity of solving the knapsack problem from a pseudo-polynomial to a fully polynomial. We also experiment with a greedy-heuristic approach and compare the results of both algorithms.
- Machine Learning Approaches for Modeling and Correction of Confounding Effects in Complex Biological DataWu, Chiung Ting (Virginia Tech, 2021-06-09)With the huge volume of biological data generated by new technologies and the booming of new machine learning based analytical tools, we expect to advance life science and human health at an unprecedented pace. Unfortunately, there is a significant gap between the complex raw biological data from real life and the data required by mathematical and statistical tools. This gap is contributed by two fundamental and universal problems in biological data that are both related to confounding effects. The first is the intrinsic complexities of the data. An observed sample could be the mixture of multiple underlying sources and we may be only interested in one or part of the sources. The second type of complexities come from the acquisition process of the data. Different samples may be gathered at different time and/or from different locations. Therefore, each sample is associated with specific distortion that must be carefully addressed. These confounding effects obscure the signals of interest in the acquired data. Specifically, this dissertation will address the two major challenges in confounding effects removal: alignment and deconvolution. Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention time (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. To utilize this information, we develop an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. We applied ncGTW to two large-scale metabolomics LC-MS datasets, which identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. When the desired signal is buried in a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of individual sources, instead of mixtures. Though there are some promising supervised deconvolution methods, when there is no a priori information, unsupervised deconvolution is still needed. Among current unsupervised methods, Convex Analysis of Mixtures (CAM) is the most theoretically solid and strongest performing one. However, there are some major limitations of this method. Most importantly, the overall time complexity can be very high, especially when analyzing a large dataset or a dataset with many sources. Also, since there are some stochastic and heuristic steps, the deconvolution result is not accurate enough. To address these problems, we redesigned the modules of CAM. In the feature clustering step, we propose a clustering method, radius-fixed clustering, which could not only control the space size of the cluster, but also find out the outliers simultaneously. Therefore, the disadvantages of K-means clustering, such as instability and the need of cluster number are avoided. Moreover, when identifying the convex hull, we replace Quickhull with linear programming, which decreases the computation time significantly. To avoid the not only heuristic but also approximated step in optimal simplex identification, we propose a greedy search strategy instead. The experimental results demonstrate the vast improvement of computation time. The accuracy of the deconvolution is also shown to be higher than the original CAM.
- Machine learning enabled bioinformatics tools for analysis of biologically diverse samplesLu, Yingzhou (Virginia Tech, 2023-08-25)Advanced molecular profiling technologies, utilizing the entire human genome, have opened new avenues to study biological systems. In recent decades, the generation of vast volumes of multi-omics data, spanning a broad range of phenotypes. Development of advanced bioinformatics tools to identify informative biomarkers from these data becomes increasingly important. These tools are crucial to extract meaningful biomarkers from this data, especially for understanding the biological pathways responsible for disease development. The identification of signature genes and the analysis of differentially networked genes are two fundamental and critically important tasks. However, many current methodologies employ test statistics that don't align perfectly with the signature definition, potentially leading to the identification of imprecise signatures. It may be challenging because the test statistics employed by many prevailing methods fall short of fulfilling the exact definition of a marker genes, inherently leaving them susceptible to deriving inaccurate features. The problem is further compounded when attempting to identify marker genes across biologically diverse samples, especially when comparing more than two biological conditions. Additionally, traditional differential group analysis or co-expression analysis under singular conditions often falls short in certain scenarios. For instance, the subtle expression levels of transcription factors (TFs) make their detection daunting, despite their pivotal role in guiding gene expression. Pinpointing the intricate network landscape of complex ailments and isolating core genes for subsequent analysis are challenging tasks. Yet, these marker genes are instrumental in identifing potential pivotal pathways. Multi-omics data, with its inherent complexity and diversity, presents unique challenges that traditional methods might struggle to address effectively. Recognizing this, our team sought to introduce new and innovative techniques specifically designed to handle this intricate dataset. To overcome these challenges, it is vital to develop and adopt innovative methods tailored to handle the complexity and diversity inherent in multi-omics data. In response to these challenges, we have pioneered the Cosine-based One-sample Test (COT), a method meticulously crafted for the analysis of biologically diverse samples. Tailored to discern marker genes across a spectrum of subtypes using their expression profiles, COT employs a one-sample test framework. The test statistic within COT utilizes cosine similarity, comparing a molecule's expression profile across various subtypes with the precise mathematical representation of ideal marker genes. To ensure ease of application and accessibility, we've encapsulated the COT workflow within a Python package. To assess its effectiveness, we undertook an exhaustive evaluation, juxtaposing the marker genes detection capabilities of COT against its contemporaries. This evaluation employed realistic simulation data. Our findings indicated that COT was not only adept at handling gene expression data but was also proficient with proteomics data. This data, sourced from enriched tissue or cell subtype samples, further accentuated COT's superior performance. We demonstrated the heightened effectiveness of COT when applied to gene expression and proteomics data originating from distinct tissue or cell subtypes. This led to innovative findings and hypotheses in several biomedical case studies. Additionally, we have enhanced the Differential Dependency Network (DDN) framework to detect network rewiring between different conditions where significantly rewired network modes serve as informative biomarkers. Using cross-condition data and a block-wise Lasso network model, DDN detects significant network rewiring together with a subnetwork of hub molecular entities. In DDN 3.0, we took the imbalanced sample size into the consideration, integrated several acceleration strategies to enable it to handle large datasets, and enhanced the network presentation for more informative network displays including color-coded differential dependency network and gradient heatmap. We applied it to the simulated data and real data to detect critical changes in molecular network topology. The current tool stands as a valuable blueprint for the development and validation of mechanistic disease models. This foundation aids in offering a coherent interpretation of data, deepening our understanding of disease biology, and sparking new hypotheses ripe for subsequent validation and exploration. As we chart our future course, our vision is to expand the scope of tools like COT and DDN 3.0, explore the vast realm of multi-omics data, including those from longitudinal studies or clinical trials. We're looking at incorporating datasets from longitudinal studies and clinical trials – domains where data complexity scales to new heights. We believe that these tools can facilitate more nuanced and comprehensive understanding of disease development and progression. Furthermore, by integrating these methods with other advanced bioinformatics and machine learning tools, we aim to create a holistic pipeline that will allow for seamless extraction of significant biomarkers and actionable insights from multi-omics data. This is a promising step towards precision medicine, where individual genomic information can guide personalized treatment strategies.
- Multi-omics Data Integration for Identifying Disease Specific Biological PathwaysLu, Yingzhou (Virginia Tech, 2018-06-05)Pathway analysis is an important task for gaining novel insights into the molecular architecture of many complex diseases. With the advancement of new sequencing technologies, a large amount of quantitative gene expression data have been continuously acquired. The springing up omics data sets such as proteomics has facilitated the investigation on disease relevant pathways. Although much work has previously been done to explore the single omics data, little work has been reported using multi-omics data integration, mainly due to methodological and technological limitations. While a single omic data can provide useful information about the underlying biological processes, multi-omics data integration would be much more comprehensive about the cause-effect processes responsible for diseases and their subtypes. This project investigates the combination of miRNAseq, proteomics, and RNAseq data on seven types of muscular dystrophies and control group. These unique multi-omics data sets provide us with the opportunity to identify disease-specific and most relevant biological pathways. We first perform t-test and OVEPUG test separately to define the differential expressed genes in protein and mRNA data sets. In multi-omics data sets, miRNA also plays a significant role in muscle development by regulating their target genes in mRNA dataset. To exploit the relationship between miRNA and gene expression, we consult with the commonly used gene library - Targetscan to collect all paired miRNA-mRNA and miRNA-protein co-expression pairs. Next, by conducting statistical analysis such as Pearson's correlation coefficient or t-test, we measured the biologically expected correlation of each gene with its upstream miRNAs and identify those showing negative correlation between the aforementioned miRNA-mRNA and miRNA-protein pairs. Furthermore, we identify and assess the most relevant disease-specific pathways by inputting the differential expressed genes and negative correlated genes into the gene-set libraries respectively, and further characterize these prioritized marker subsets using IPA (Ingenuity Pathway Analysis) or KEGG. We will then use Fisher method to combine all these p-values derived from separate gene sets into a joint significance test assessing common pathway relevance. In conclusion, we will find all negative correlated paired miRNA-mRNA and miRNA-protein, and identifying several pathophysiological pathways related to muscular dystrophies by gene set enrichment analysis. This novel multi-omics data integration study and subsequent pathway identification will shed new light on pathophysiological processes in muscular dystrophies and improve our understanding on the molecular pathophysiology of muscle disorders, preventing and treating disease, and make people become healthier in the long term.
- Naturally Generated Decision Trees for Image ClassificationRavi, Sumved Reddy (Virginia Tech, 2021-08-31)Image classification has been a pivotal area of research in Deep Learning, with a vast body of literature working to tackle the problem, constantly striving to achieve higher accuracies. This push to reach achieve greater prediction accuracy however, has further exacerbated the black box phenomenon which is inherent of neural networks, and more for so CNN style deep architectures. Likewise, it has lead to the development of highly tuned methods, suitable only for a specific data sets, requiring significant work to alter given new data. Although these models are capable of producing highly accurate predictions, we have little to no ability to understand the decision process taken by a network to reach a conclusion. This factor poses a difficulty in use cases such as medical diagnostics tools or autonomous vehicles, which require insight into prediction reasoning to validate a conclusion or to debug a system. In essence, modern applications which utilize deep networks are able to learn to produce predictions, but lack interpretability and a deeper understanding of the data. Given this key point, we look to decision trees, opposite in nature to deep networks, with a high level of interpretability but a low capacity for learning. In our work we strive to merge these two techniques as a means to maintain the capacity for learning while providing insight into the decision process. More importantly, we look to expand the understanding of class relationships through a tree architecture. Our ultimate goal in this work is to create a technique able to automatically create a visual feature based knowledge hierarchy for class relations, applicable broadly to any data set or combination thereof. We maintain these goals in an effort to move away from specific systems and instead toward artificial general intelligence (AGI). AGI requires a deeper understanding over a broad range of information, and more so the ability to learn new information over time. In our work we embed networks of varying sizes and complexity within decision trees on a node level, where each node network is responsible for selecting the next branch path in the tree. Each leaf node represents a single class and all parent and ancestor nodes represent groups of classes. We designed the method such that classes are reasonably grouped by their visual features, where parent and ancestor nodes represent hidden super classes. Our work aims to introduce this method as a small step towards AGI, where class relations are understood through an automatically generated decision tree (representing a class hierarchy), capable of accurate image classification.
- On Reducing the Trusted Computing Base in Binary VerificationAn, Xiaoxin (Virginia Tech, 2022-06-15)The translation of binary code to higher-level models has wide applications, including decompilation, binary analysis, and binary rewriting. This calls for high reliability of the underlying trusted computing base (TCB) of the translation methodology. A key challenge is to reduce the TCB by validating its soundness. Both the definition of soundness and the validation method heavily depend on the context: what is in the TCB and how to prove it. This dissertation presents three research contributions. The first two contributions include reducing the TCB in binary verification, and the last contribution includes a binary verification process that leverages a reduced TCB. The first contribution targets the validation of OCaml-to-PVS translation -- commonly used to translate instruction-set-architecture (ISA) specifications to PVS -- where the destination language is non-executable. We present a methodology called OPEV to validate the translation between OCaml and PVS, supporting non-executable semantics. The validation includes generating large-scale tests for OCaml implementations, generating test lemmas for PVS, and generating proofs that automatically discharge these lemmas. OPEV incorporates an intermediate type system that captures a large subset of OCaml types, employing a variety of rules to generate test cases for each type. To prove the PVS lemmas, we develop automatic proof strategies and discharge the test lemmas using PVS Proof-Lite, a powerful proof scripting utility of the PVS verification system. We demonstrate our approach in two case studies that include 259 functions selected from the Sail and Lem libraries. For each function, we generate thousands of test lemmas, all of which are automatically discharged. The dissertation's second contribution targets the soundness validation of a disassembly process where the source language does not have well-defined semantics. Disassembly is a crucial step in binary security, reverse engineering, and binary verification. Various studies in these fields use disassembly tools and hypothesize that the reconstructed disassembly is correct. However, disassembly is an undecidable problem. State-of-the-art disassemblers suffer from issues ranging from incorrectly recovered instructions to incorrectly assessing which addresses belong to instructions and which to data. We present DSV, a systematic and automated approach to validate whether the output of a disassembler is sound with respect to the input binary. No source code, debugging information, or annotations are required. DSV defines soundness using a transition relation defined over concrete machine states: a binary is sound if, for all addresses in the binary that can be reached from the binary's entry point, the bytes of the (disassembled) instruction located at an address are the same as the actual bytes read from the binary. Since computing this transition relation is undecidable, DSV uses over-approximation by preventing false positives (i.e., the existence of an incorrectly disassembled reachable instruction but deemed unreachable) and allowing, but minimizing, false negatives. We apply DSV to 102 binaries of GNU Coreutils with eight different state-of-the-art disassemblers from academia and industry. DSV is able to find soundness issues in the output of all disassemblers. The dissertation's third contribution is WinCheck: a concolic model checker that detects memory-related properties of closed-source binaries. Bugs related to memory accesses are still a major issue for security vulnerabilities. Even a single buffer overflow or use-after-free in a large program may be the cause of a software crash, a data leak, or a hijacking of the control flow. Typical static formal verification tools aim to detect these issues at the source code level. WinCheck is a model-checker that is directly applicable to closed-source and stripped Windows executables. A key characteristic of WinCheck is that it performs its execution as symbolically as possible while leaving any information related to pointers concrete. This produces a model checker tailored to pointer-related properties, such as buffer overflows, use-after-free, null-pointer dereferences, and reading from uninitialized memory. The technique thus provides a novel trade-off between ease of use, accuracy, applicability, and scalability. We apply WinCheck to ten closed-source binaries available in a Windows 10 distribution, as well as the Windows version of the entire Coreutils library. We conclude that the approach taken is precise -- provides only a few false negatives -- but may not explore the entire state space due to unresolved indirect jumps.
- OneSwitch Data Center ArchitectureSehery, Wile Ali (Virginia Tech, 2018-04-13)In the last two-decades data center networks have evolved to become a key element in improving levels of productivity and competitiveness for different types of organizations. Traditionally data center networks have been constructed with 3 layers of switches, Edge, Aggregation, and Core. Although this Three-Tier architecture has worked well in the past, it poses a number of challenges for current and future data centers. Data centers today have evolved to support dynamic resources such as virtual machines and storage volumes from any physical location within the data center. This has led to highly volatile and unpredictable traffic patterns. Also The emergence of "Big Data" applications that exchange large volumes of information have created large persistent flows that need to coexist with other traffic flows. The Three-Tier architecture and current routing schemes are no longer sufficient for achieving high bandwidth utilization. Data center networks should be built in a way where they can adequately support virtualization and cloud computing technologies. Data center networks should provide services such as, simplified provisioning, workload mobility, dynamic routing and load balancing, equidistant bandwidth and latency. As data center networks have evolved the Three-Tier architecture has proven to be a challenge not only in terms of complexity and cost, but it also falls short of supporting many new data center applications. In this work we propose OneSwitch: A switch architecture for the data center. OneSwitch is backward compatible with current Ethernet standards and uses an OpenFlow central controller, a Location Database, a DHCP Server, and a Routing Service to build an Ethernet fabric that appears as one switch to end devices. This allows the data center to use switches in scale-out topologies to support hosts in a plug and play manner as well as provide much needed services such as dynamic load balancing, intelligent routing, seamless mobility, equidistant bandwidth and latency.
- Optimization of an Emergency Response Vehicle's Intra-Link Movement in Urban Transportation Networks Utilizing a Connected Vehicle EnvironmentHannoun, Gaby Joe (Virginia Tech, 2019-07-31)Downstream vehicles detect an emergency response vehicle (ERV) through sirens and/or strobe lights. These traditional warning systems do not give any recommendation about how to react, leaving the drivers confused and often adopting unsafe behavior while trying to open a passage for the ERV. In this research, an advanced intra-link emergency assistance system, that leverages the emerging technologies of the connected vehicle environment, is proposed. The proposed system assumes the presence of a centralized system that gathers/disseminates information from/to connected vehicles via vehicle-to-infrastructure (V2I) communications. The major contribution of this dissertation is the intra-link level support provided to ERV as well as non-ERVs. The proposed system provides network-wide assistance as it also considers the routing of ERVs. The core of the system is a mathematical program - a set of equations and inequalities - that generates, based on location and speed data from connected vehicles that are downstream of the ERV, the fastest intra-link ERV movement. It specifies for each connected non-ERV a final assigned position that the vehicle can reach comfortably along the link. The system accommodates partial market penetration levels and is applicable on large transportation link segments with signalized intersections. The system consists of three modules (1) an ERV route generation module, (2) a criticality analysis module and (2) the sequential optimization module. The first module determines the ERV's route (set of links) from the ERV's origin to the desired destination in the network. Based on this selected route, the criticality analysis module scans/filters the connected vehicles of interest and determines whether any of them should be provided with a warning/instruction message. As the ERV is moving towards its destination, new non-ERVs should be notified. When a group of non-ERVs is identified by the criticality analysis module, a sequential optimization module is activated. The proposed system is evaluated using simulation under different combinations of market penetration and congestion levels. Benefits in terms of ERV travel time with an average reduction of 9.09% and in terms of vehicular interactions with an average reduction of 35.46% and 81.38% for ERV/non-ERV and non-ERV/non-ERV interactions respectively are observed at 100% market penetration, when compared to the current practice where vehicles moving to the nearest edge.