Browsing by Author "Khaghani, Farnaz"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- Collection Management Tweets Project Fall 2017Khaghani, Farnaz; Zeng, Junkai; Bhuiyan, Momen; Tabassum, Anika; Bandyopadhyay, Payel (Virginia Tech, 2018-01-17)The report included in this submission documents the work by the Collection Management Tweets (CMT) team, which is a part of the bigger effort in CS5604 on building a state-of-the-art information retrieval and analysis system for the IDEAL (Integrated Digital Event Archiving and Library) and GETAR (Global Event and Trend Archive Research) projects. The mission of the CMT team had two parts: 1) Cleaning 6.2 million tweets from two 2017 event collections named "Solar Eclipse" and "Las Vegas Shooting", and loading them into HBase, an open source, non-relational, distributed database that runs on the Hadoop distributed file system, in support of further use; and 2) Building and storing a social network for the tweet data using a triple-store. For the first part, our work included: A) Making use of the work done by the previous year's class group, where incremental update was done, to introduce a faster development process of data collection and storing; B) Improving the performance of work done by the group from last year. Previously, the cleaning part, e.g., removing profanity words, plus extracting hashtags and mentions, utilized Python. This becomes very slow when the dataset scales up. We introduced parallelization in our tweet cleaning process with the help of Scala and the Hadoop cluster, and made use of different Natural Language Processing libraries for stop word and profanity removal; C) Along with tweet cleaning we also identified and stored Named-Entity-Recognition (NER) entries and Part-of-speech (POS) tags, with the tweets which was not done by the previous team. The cleaned data in HBase from this task is provided to the Classification team for spam detection and to the Clustering and Topic Analysis team for topic analysis. Collection Management Webpage team uses the extracted URLs from the tweets for further processing. Finally, after the data is indexed by the SOLR team, the Front-End team visualizes the tweets to users, and provides access for searching and browsing. In addition to the aforementioned tasks, our responsibilities also included building a network of tweets. This entailed doing research into the types of database that are appropriate for this graph. For storing the network, we used a triple-store database to record different types of edges and relationships in the graph. We also researched methods ascribing importance to nodes and edges in our social networks once they were constructed, and analyzed our networks using these techniques.
- CS4984/CS5984: Big Data Text Summarization Team 17 ETDsKhaghani, Farnaz; Marin Thomas, Ashin; Patnayak, Chinmaya; Sharma, Dhruv; Aromando, John (Virginia Tech, 2018-12-15)Given the current explosion of information over various media such as electronic and physical texts, concise and relevant data has become key to the understanding of things. Summarization, which essentially is the process of reducing the text to convey only the salient aspects, has emerged as a challenging task in the field of Natural Language Processing. In a scientific construct, academia has been generating voluminous amounts of data in the form of theses and dissertations. Obtaining the chapter-wise summary of an electronic thesis or dissertation can be a computationally expensive task, particularly because of its length and the subject to which it pertains to. Through this course, research and development of various summarization techniques, primarily extractive and abstractive summarization, were analyzed. There have been various developments in the field of deep learning to tackle problems related to summarization and produce coherent and meaningful summaries for news articles. In this project, tools that could be used to generate coherent and concise summaries of long electronic theses and dissertations (ETDs) were developed as well. The major concern initially was to get the text from a PDF file of an ETD. GROBID and Scienceparse were used as pre-processing tools to carry out this task and presented the text from a PDF in a structured format such as XML or JSON file. The outputs from each of the tools were compared qualitatively as well as quantitatively. After this, a transfer learning approach was adopted, wherein a pre-trained model was tweaked to fit to the task of summarizing each ETD. This came in as a challenge to make the model learn the nuances of an ETD. An iterative approach was used to explore various networks, each trying to improve the shortcomings of the previous one in its novel way. Existing deep learning models including Sequence-2-Sequence, Pointer Generator Networks, and A Hybrid Extractive-Abstractive Reinforce-Selecting Sentence Rewriting Network, were used to generate and test summaries. Further tweaks were made to these deep neural networks to account for much longer and varied datasets as compared to what they were inherently designed to work for -- in this case ETDs. A thorough evaluation of these generated summaries was also done with respect to golden standards for five dissertations and theses created during the span of the course. ROUGE-1, ROUGE-2, and ROUGE-SU4 were used to compare the generated summaries with the golden standards. The average ROUGE scores were 0.1387, 0.1224, and 0.0480 respectively. These low ROUGE scores could be attributed to the varying summary length, and also to the complexity of the task of summarizing an ETD. The scope of improvements and the underlying reasons for the performance have also been analyzed. The conclusion that can be drawn from the project is that any machine learning task is highly biased by what pattern is inherently present in the data on which it is being trained. In the context of summarization, there can be a different perspective from which an article can be summarized, and thus the quantitative evaluation measures can vary drastically even after the summary is a coherent one.
- A Deep Learning Approach to Predict Accident Occurrence Based on Traffic DynamicsKhaghani, Farnaz (Virginia Tech, 2020-05)Traffic accidents are of concern for traffic safety; 1.25 million deaths are reported each year. Hence, it is crucial to have access to real-time data and rapidly detect or predict accidents. Predicting the occurrence of a highway car accident accurately any significant length of time into the future is not feasible since the vast majority of crashes occur due to unpredictable human negligence and/or error. However, rapid traffic incident detection could reduce incident-related congestion and secondary crashes, alleviate the waste of vehicles’ fuel and passengers’ time, and provide appropriate information for emergency response and field operation. While the focus of most previously proposed techniques is predicting the number of accidents in a certain region, the problem of predicting the accident occurrence or fast detection of the accident has been little studied. To address this gap, we propose a deep learning approach and build a deep neural network model based on long short term memory (LSTM). We apply it to forecast the expected speed values on freeways’ links and identify the anomalies as potential accident occurrences. Several detailed features such as weather, traffic speed, and traffic flow of upstream and downstream points are extracted from big datasets. We assess the proposed approach on a traffic dataset from Sacramento, California. The experimental results demonstrate the potential of the proposed approach in identifying the anomalies in speed value and matching them with accidents in the same area. We show that this approach can handle a high rate of rapid accident detection and be implemented in real-time travelers’ information or emergency management systems.
- mD-Resilience: A Multi-Dimensional Approach for Resilience-Based Performance Assessment in Urban TransportationKhaghani, Farnaz; Jazizadeh, Farrokh (MDPI, 2020-06-15)As demonstrated for extreme events, the resilience concept is used to evaluate the ability of a transportation system to resist and recover from disturbances. Motivated by the high cumulative impact of recurrent perturbations on transportation systems, we have investigated resilience quantification as a performance assessment method for high-probability low-impact (HPLI) disturbances such as traffic congestions. Resilience-based metrics are supplementary to conventional travel-time-based indices in literature. However, resilience is commonly quantified as a scalar variable despite its multi-dimensional nature. Accordingly, by hypothesizing increased information gain in performance assessment, we have investigated a multi-dimensional approach (mD-Resilience) for resilience quantification. Examining roadways’ resilience to recurrent congestions as a contributor to sustainable mobility, we proposed to measure resilience with several attributes that characterize the degradation stage, the recovery stage, and possible recovery paths. These attributes were integrated into a performance index by using Data Envelopment Analysis (DEA) as a non-parametric method. We demonstrated the increased information gain by quantifying the performance of major freeways in Los Angeles, California using Performance Measurement System (PeMS) data. The comparison of mD-Resilience approach with the method based on area under resilience curves showed its potential in distinguishing the severity of congestions. Furthermore, we showed that mD-Resilience also characterizes performance from the lens of delay and bottleneck severities.
- Resilience-based Operational Analytics of Transportation Infrastructure: A Data-driven Approach for Smart CitiesKhaghani, Farnaz (Virginia Tech, 2020-07-01)Studying recurrent mobility perturbations, such as traffic congestions, is a major concern of engineers, planners, and authorities as they not only bring about delay and inconvenience but also have consequent negative impacts like greenhouse gas emission, increase in fuel consumption, or safety issues. In this dissertation, we proposed using the resilience concept, which has been commonly used for assessing the impact of extreme events and disturbances on the transportation system, for high-probability low impact (HPLI) events to (a) provide a performance assessment framework for transportation systems' response to traffic congestions, (b) investigate the role of transit modes in the resilience of urban roadways to congestion, and (c) study the impact of network topology on the resilience of roadways functionality performance. We proposed a multi-dimensional approach to characterize the resilience of urban transportation roadways for recurrent congestions. The resilience concept could provide an effective benchmark for comparative performance and identifying the behavior of the system in the discharging process in congestion. To this end, we used a Data Envelopment Analysis (DEA) approach to integrate multiple resilience-oriented attributes to estimate the efficiency (resilience) of the frontier in roadways. Our results from an empirical study on California highways through the PeMS data have shown the potential of the multi-dimensional approach in increasing information gain and differentiating between the severity of congestion across a transportation network. Leveraging this resilience-based characterization of recurrent disruptions, in the second study, we investigated the role of multi-modal resourcefulness of urban transportation systems, in terms of diversity and equity, on the resilience of roadways to daily-based congestions. We looked at the physical infrastructure availability and distribution (i.e. diversity) and accessibility and coverage to capture socio-economic factors (i.e. equity) to more comprehensively understand the role of resourcefulness in resilience. We conducted this investigation by using a GPS dataset of taxi trips in the Washington DC metropolitan area in 2017. Our results demonstrated the strong correlation of trips' resilience with transportation equity and to a lesser extent with transportation diversity. Furthermore, we learned the impact of equity and diversity can mostly be seen at the recovery stage of resilience. In the third study, we looked at another aspect of transportation supply in urban areas, spatial configuration, and topology. The goal of this study was to investigate the role of network topology and configuration on resilience to congestion. We used OSMnx, a toolkit for street network analysis based on the data from OpenStreetMap, to model and analyze the urban roadways network configurations. We further employed a multidimensional visualization strategy using radar charts to compare the topology of street networks on a single graphic. Leveraging the geometric descriptors of radar charts, we used the compactness and Jaccard Index to quantitatively compare the topology profiles. We use the same taxi trips dataset used in the second study to characterize resilience and identify the correlation with network topology. The results indicated a strong correlation between resilience and betweenness centrality, diameter, and Page Rank among other features of a transportation network. We further looked at the capacity of roadways as a common cause for the strong correlation between network features and resilience. We found that the strong correlation of link-related features such as diameter could be due to their role in capacity and have a common cause with resilience.