Browsing by Author "Giles, C. Lee"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Discipline-Independent Text Information Extraction from Heterogeneous Styled References Using Knowledge from the WebPark, Sung Hee (Virginia Tech, 2013-07-11)In education and research, references play a key role. They give credit to prior works, and provide support for reviews, discussions, and arguments. The set of references attached to a publication can help describe that publication, can aid with its categorization and retrieval, can support bibliometric studies, and can guide interested readers and researchers. If suitably analyzed, that set can aid with the analysis of the publication itself, especially regarding all its citing passages. However, extracting and parsing references are difficult problems. One concern is that there are many styles of references, and identifying what style was employed is problematic, especially in heterogeneous collections of theses and dissertations, which cover many fields and disciplines, and where different styles may be used even in the same publication. We address these problems by drawing upon suitable knowledge found in the WWW. In particular, we use appropriate lists (e.g., of names, cities, and other types of entities). We use available information about the many reference styles found, in a type of reverse engineering. We use available references to guide machine learning. In particular, we research a two-stage classifier approach, with multi-class classification with respect to reference styles, and partially solve the problem of parsing surface representations of references. We describe empirical evidence for the effectiveness of our approach and plans for improvement of our method.
- A Study of Computational Reproducibility using URLs Linking to Open Access Datasets and SoftwareSalsabil, Lamia; Wu, Jian; Choudhury, Muntabir; Ingram, William A.; Fox, Edward A.; Rajtmajer, Sarah; Giles, C. Lee (ACM, 2022-04-25)Datasets and software packages are considered important resources that can be used for replicating computational experiments. With the advocacy of Open Science and the growing interest of investigating reproducibility of scientific claims, including URLs linking to publicly available datasets and software packages has become an institutionalized part of research publications. In this preliminary study, we investigated the disciplinary dependency and chronological trends of including open access datasets and software (OADS) in electronic theses and dissertations (ETDs), based on a hybrid classifier called OADSClassifier, consisting of a heuristic and a supervised learning model. The classifier achieves the best F1 of 0.92.We found that the inclusion of OADS-URLs exhibited a strong disciplinary dependence and the fraction of ETDs containing OADS-URLs has been gradually increasing over the past 20 years.We developed and share a ground truth corpus consisting of 500 manually labeled sentences containing URLs from scientific papers. The dataset and source code are available at https://github.com/lamps-lab/oadsclassifier.