Browsing by Author "Ma, Yufeng"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- Collection Management for IDEALMa, Yufeng; Nan, Dong (2016-05-04)The collection management portion of the information retrieval system has three major tasks. The first task is to perform incremental update of the new data flow from the tweet MySQL database to HDFS and then to HBase. Secondly, for the raw tweets coming into HBase, we are supposed to clean them. Duplicated URLs should be discarded. Also important is to conduct noise reduction. Finally, for the cleaned tweets and webpages, we should do Named Entity Recognition (NER), from which we extract out the information like person, organization, and location names. First, based on existing data flow from the tweet MySQL database to HBase in the IDEAL system, we developed a Sqoop script to import new tweets from MySQL to HDFS. Then another Pig script is run to transfer them into HBase. Afterwards, for raw tweets in HBase, we run a noise reduction module to remove non-ASCII characters, extract hashtags, mentions and URLs from tweet text. Similar procedures were also performed for raw webpage records provided by the GRAs for this project. All the cleaned data for the 6 small collections have been uploaded into HBase with pre-defined schemas documented in this report. Then all the other teams like classification and clustering can consume our cleaned data. Besides what has been done so far, it is desirable to do NER, which tries to extract structured information such as person, organization and location from unstructured text. But due to time limitations, this must be relegated to future work. Also needed is automating the webpage crawling and cleaning processes, which are essential after incremental update. That would expand URLs extracted from tweets in HBase first, and then crawl the corresponding webpages after invalid URL removal. Finally, extracted useful information in webpages would be stored into HBase.
- ETDseer Concept PaperMa, Yufeng; Jiang, Tingting; Shrestha, Chandani (Virginia Tech, 2017-05-03)ETDSeer (electronic thesis and dissertation digital library connected with SeerSuite) will build on 15 years of collaboration between teams at Virginia Tech (VT) and Penn State University (PSU), since both have been leaders in the worldwide digital library (DL) community. VT helped launch the national and international efforts for ETDs more than 20 years ago, which have been led by the Networked Digital Library of Theses and Dissertations (NDLTD, directed by PI Fox); its Union Catalog has increased to 5 million records. PSU hosts CiteSeerX, which co-PI Giles launched almost 20 years ago, and which is connected with a wide variety of research results under the SeerSuite family. ETDs, typically in PDF, are a largely untapped international resource. Digital libraries with advanced services can effectively address the broad needs to discover and utilize ETDs of interest. Our research will leverage SeerSuite methods that have been applied mostly to short documents, plus a variety of exploratory studies at VT, and will yield a “web of graduate research”, rich knowledge bases, and a digital library with effective interfaces. References will be analyzed and converted to canonical forms, figures and tables will be recognized and re-represented for flexible searching, small sections (acknowledgments, biographical sketches) will be mined, and aids for researchers will be built especially from literature reviews and discussions of future work. Entity recognition and disambiguation will facilitate flexible use of a large graph of linked open data.
- Going Deeper with Images and Natural LanguageMa, Yufeng (Virginia Tech, 2019-03-29)One aim in the area of artificial intelligence (AI) is to develop a smart agent with high intelligence that is able to perceive and understand the complex visual environment around us. More ambitiously, it should be able to interact with us about its surroundings in natural languages. Thanks to the progress made in deep learning, we've seen huge breakthroughs towards this goal over the last few years. The developments have been extremely rapid in visual recognition, in which machines now can categorize images into multiple classes, and detect various objects within an image, with an ability that is competitive with or even surpasses that of humans. Meanwhile, we also have witnessed similar strides in natural language processing (NLP). It is quite often for us to see that now computers are able to almost perfectly do text classification, machine translation, etc. However, despite much inspiring progress, most of the achievements made are still within one domain, not handling inter-domain situations. The interaction between the visual and textual areas is still quite limited, although there has been progress in image captioning, visual question answering, etc. In this dissertation, we design models and algorithms that enable us to build in-depth connections between images and natural languages, which help us to better understand their inner structures. In particular, first we study how to make machines generate image descriptions that are indistinguishable from ones expressed by humans, which as a result also achieved better quantitative evaluation performance. Second, we devise a novel algorithm for measuring review congruence, which takes an image and review text as input and quantifies the relevance of each sentence to the image. The whole model is trained without any supervised ground truth labels. Finally, we propose a brand new AI task called Image Aspect Mining, to detect visual aspects in images and identify aspect level rating within the review context. On the theoretical side, this research contributes to multiple research areas in Computer Vision (CV), Natural Language Processing (NLP), interactions between CVandNLP, and Deep Learning. Regarding impact, these techniques will benefit related users such as the visually impaired, customers reading reviews, merchants, and AI researchers in general.
- A Preliminary Analysis of Images in Online Hotel ReviewsZach, Florian J.; Ma, Yufeng; Fox, Edward A. (2019)Following the idiom "A picture is worth a thousand words", recent tourism and hospitality research has adopted deep learning techniques to better understand the content and effect of photographs. This exploratory study delves into image content, to learn which sets of images relate to positive and negative reviews. Several image classes appear in both positive and negative reviews. However, the distribution of images across classes differs. Of interest were images that did not represent core hotel services, which suggests that users review the surrounding area, attractions, and activities as well as the hotel property. This finding is relevant for managers to learn which areas (to engage with) could improve guest experiences.
- A Surrogate-based Generic Classifier for Chinese TV Series ReviewsMa, Yufeng; Xia, Long; Shen, Wenqi; Zhou, Mi; Fan, Weiguo (2016-11-21)With the emerging of various online video platforms like Youtube, Youku and LeTV, online TV series' reviews become more and more important both for viewers and producers. Customers rely heavily on these reviews before selecting TV series, while producers use them to improve the quality. As a result, automatically classifying reviews according to different requirements evolves as a popular research topic and is essential in our daily life. In this paper, we focused on reviews of hot TV series in China and successfully trained generic classifiers based on eight predefined categories. The experimental results showed promising performance and effectiveness of its generalization to different TV series.