Browsing by Author "Zhang, Shuaicheng"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- CS 5604: Information Storage and Retrieval - Webpages (WP) TeamBarry-Straume, Jostein; Vives, Cristian; Fan, Wentao; Tan, Peng; Zhang, Shuaicheng; Hu, Yang; Wilson, Tishauna (Virginia Tech, 2020-12-18)The first major goal of this project is to build a state-of-the-art information retrieval engine for searching webpages and for opening up access to existing and new webpage collections resulting from Digital Library Research Laboratory (DLRL) projects relating to eventsarchive.org. The task of the Webpage (WP) team was to provide the functionality of making any archived webpage accessible and indexed. The webpages can be obtained either through event focused crawlers or collections of data, such as WARC files containing webpages, or sets of tweets which contains embedded URLs. Toward completion of the project, the WP team worked on four major tasks: 1.) Contents of WARC files searchable through ElasticSearch. 2.) Contents of WARC files cleaned and searchable through ElasticSearch. 3.) Event focused crawler running and producing WARC files. 4.) Additional extracted/derived information (e.g., dates, classes) made searchable. The foundation of the software is a Docker container cluster employing Airflow, a Reasoner, and Kubernetes. The raw data of the information content of the given webpage collections is stored using the Network File System (NFS), while Ceph is used for persistent storage for the Docker containers. Retrieval, analysis, and visualization of the webpage collection is carried out with ElasticSearch and Kibana, respectively. These two technologies form an Elastic Stack application which serves as the vehicle with which the WP team indexes, maps, and stores the processed data and model outputs with regards to webpage collections. The software is co-designed by 7 team members of Virginia Tech graduate students, all members of the same computer science class, CS 5604: Information Storage and Retrieval. The course is taught by Professor Edward A. Fox. Dr. Fox structures the class in a way for his students to perform in a “mock” business development setting. In other words, the academic project submitted by the WP team for all intents and purposes can be viewed as a microcosm of software development within a corporate structure. This submission focuses on the work of the WP team, which creates and administers Docker containers such that various services are tested and deployed in whole. Said services pertain solely to the ingestion, cleansing, analysis, extraction, classification, and indexing of webpages and their respective content.
- Product Defect MiningVillaflor, Elizabeth M.; Golden, Grant D.; Hall, Jack W. W.; Nguyen, Thomas; Peng, Tianchen; Zhang, Shuaicheng (Virginia Tech, 2017-05-01)This project is focused on customer reviews on various product defects. The goal of the project is to use machine learning algorithms to train on sets of these customer reviews in order to be able to easily identify the different defect entities within an unseen review. The identification of these entities will be beneficial to customers, product manufacturers, and governments as it will shed light on the most common defects for a certain product, as well as common defects across a class of products. Additionally, it will bring to light common resolutions for defect symptoms, including both correct and incorrect resolutions. This project also aims to make contributions to the opinion mining research community. These goals will be accomplished by breaking the project into three main parts: data collection, data labeling, and classifier training. In the data collection phase, a web crawler will be created to pull customer reviews off of forum sites in order to create new datasets. For data labeling, datasets, both pre-existing and newly created, will be split into sentences and be assigned a defect entity based on the content of the sentence. For example, if a sentence describes a product defect, the sentence will be labeled as a symptom, and so on. Finally, in the classifier training portion of the project, machine learning algorithms will be used to classify unlabeled datasets in order to learn what types of words indicate a certain defect entity. While these are the three main aspects of the project, there are other minor phases and categories of work that will be necessary. One of these sub-phases includes designing the database tables that will be used to store the labeled datasets. Throughout the semester the following was accomplished: the creation of a web crawler, the completion of five new datasets, the labeling of five datasets, and preliminary training results based on the linear SVC algorithm. Additionally, the new datasets and labeled datasets were uploaded into the client’s preexisting database. The new datasets were collected from the Apple Community, Samsung, and Dell forum boards and include product defect reports for both hardware and software products. Based on the labeling results, and quick scans of the collected data, it was found that many defect reports contain contextual information that is not directly related to the description of either a product defect or its corresponding solution. Additionally, it was found that many reports do not include resolutions or the resolution did not actual solve the defect described. The linear SVC algorithm used for classifier training was able to accurately predict the label for a sentence about 80% of the time when training and testing occurred on similar products, i.e. two different car models. However, the accuracy was only about 60% at best when used on two completely different products, i.e. cars vs cellphones. Overall, about 75% of the anticipated work was completed this semester. The work that was completed should provide a good foundation for continued work in the future.
- TGEditor: Task-Guided Graph Editing for Augmenting Temporal Financial Transaction NetworksZhang, Shuaicheng; Zhu, Yada; Zhou, Dawei (ACM, 2023-11-27)Recent years have witnessed a growth of research interest in designing powerful graph mining algorithms to discover and characterize the structural pattern of interests from financial transaction networks, motivated by impactful applications including anti-money laundering, identity protection, product promotion, and service promotion. However, state-of-the-art graph mining algorithms often suffer from high generalization errors due to data sparsity, data noisiness, and data dynamics. In the context of mining information from financial transaction networks, the issues of data sparsity, noisiness, and dynamics become particularly acute. Ensuring accuracy and robustness in such evolving systems is of paramount importance. Motivated by these challenges, we propose a fundamental transition from traditional mining to augmentation in the context of financial transaction networks. To navigate this paradigm shift, we introduce TGEditor, a versatile task-guided temporal graph augmentation framework. This framework has been crafted to concurrently preserve the temporal and topological distribution of input financial transaction networks, whilst leveraging the label information from pertinent downstream tasks, denoted as T, inclusive of crucial downstream tasks like fraudulent transaction classification. In particular, to efficiently conduct task-specific augmentation, we propose two network editing operators that can be seamlessly optimized via adversarial training, while simultaneously capturing the dynamics of the data: Add operator aims to recover the missing temporal links due to data sparsity, and Prune operator is formulated to remove irrelevant/noisy temporal links due to data noisiness. Extensive results on financial transaction networks demonstrate that TGEditor 1) well preserves the data distribution of the original graph and 2) notably boosts the performance of the prediction models in the tasks of vertex classification and fraudulent transaction detection.