Regression analysis of extended vectors to obtain coefficients for use in probabilistic information retrieval systems

TR Number
Journal Title
Journal ISSN
Volume Title
Virginia Polytechnic Institute and State University

Previous work by Fox has extended the vector space model of information retrieval and its implementation in the SMART system so different types of information about documents can be separately handled as multiple subvectors, each for a different concept type. We hypothesized that relevance of a document could be best predicted if proper coefficients are obtained to reflect the importance of the query-document similarity for each subvector when computing an overall similarity value. Two different research collections, CACM and ISI, each split into halves, were used to generate data for the regression studies to obtain coefficients. Most of the variance in relevance could be accounted for by only four of the subvectors (authors, Computing Review descriptors, links, and terms) for the CACM1 collection. In the ISI1 collection, two of the vectors (terms and cocitations) accounted for most of the variance. Log transformed data and samples of the records gave the best RSQ's; .6654 was the highest RSQ (binary relevance). The regression runs provided coefficients which were used in subsequent feedback runs in SMART. Having ranked relevance did not improve the regression model over binary relevance. The coefficients in the feedback runs with SMART proved to be of limited usefulness since improvements in precision were in the 1-5% range. Although log data and samples of the records gave the best RSQ's, coefficients from log values of all data improved precision the most. The findings of this study support previous work of Fox, that additional information improves retrieval. Regression coefficients improved precision slightly when used as subvector weights. Log transforming the data values for the concept types modestly helped both the regression analyses and the retrieval in SMART.