Semantic Dimensionality and Effective Data Fusion in Information Retrieval.

Funded by Information and Data Management Program of National Science Foundation    
Grant Number IIS-9812086  
 

PI: Paul Kantor
4 Huntington Street, SCILS, New Brunswick, NJ 08901
Phone: (732) 932-1359; Fax: (732) 932-1504
Email:  kantor@scils.rutgers.edu ; URL:  scils.rutgers.edu/~kantor

Co-PI: Kwong Bor Ng
Graduate School of Library and Information Studies, Queens College, CUNY
65-30 Kissena Blvd. Flushing, New York 11367
Phone: (718) 997-3613; Fax : (718) 997-3797
Email: kbng@qc.edu ; URL:  qcunix1.qc.edu/~kbng
 

Project Summary

The proposed research will investigate the effectiveness of data fusion schemes for information retrieval.  Data fusion techniques combine the estimates of relevance or usefulness provided by several different schemes, to produce a richer and more refined set of documents for examination by the human seeking information.  This problem will  become ever more important as information must be found in complex networked environments, by scientists, business people, students, and ordinary citizens.  It is unlikely that any one system will solve the problem of finding the best and most useful sources and documents.  Data fusion, widely used in image processing and signal detection, has been shown in other settings, where the “noise” is largely random, to give substantial performance improvements.  The proposed work will be an empirical study, based on theories developed by the investigators, and using a large collection of existing data developed at the Text Retrieval Conferences at NIST, to find laws or rules which predict which retrieval schemes should be combined, and how, to provide improved performance.

Goals, Objectives, and Targeted Activities

The raw material consists of ranked lists of documents L(t,s) prepared for each of more than 250 “topics” t, by each of the schemes or systems s participating in TREC in a given year.  For each year, and for each topic t, we will compute a generalization of Kendall’s tau coefficient, which is appropriate for lists which may not contain the same entities.  The resulting measure z(s,s’) can be used as one of the predictive variables. In addition we have available individual performance measures w(t,s) for every topic-system combination. The results of this research will be predictive models for deciding when two schemes (or more than two) can be effectively used in DF, to improve IR.  Research will be conducted by applying discriminant analysis, non-linear clustering techniques, and other statistical methods to identify the form of the function f, and the most powerful predictive variables.  Extensive use is made of the “Receiver Operating Characteristic” concept to determine whether one model is absolutely, or only conditionally, more powerful than another.
 
Mid Term Report

Mid Term Report of Research Project APLab/RP-98/01


Grant Report 2000

Grant Report of Research Project APLab/RP-00/01

Technical Reports

1 Kantor, P., Ng, K.B. & Hull, D. (1997)   Comparison of Systems Using Pairs-Out-Of-Order
2 Hull, D., Kantor, P. & K.B.Ng (1997) Advanced Approaches to the Statistical Analysis of TREC Information Retrieval Experiments.
3 Ng, K.B. & Kantor, P. (1998) An Investigation of Two Predictive Variables for Effective Data Fusion in Information Retrieval: Part 1: Comparison of Three Statistic Analysis Methods
4 Ng, K.B. & Kantor, P. (1999) An Investigation of Two Predictive Variables for Effective Data Fusion in Information Retrieval: Part 2: Predictive Power of Two Parametric Analysis Methods and One Non-Parametric Analysis Method.
5 Ng, K.B. & Kantor, P. (2000) Predicting the Effectiveness of Naive Data Fusion on the Basis of System Characteristics. Also published in Journal of American Society for Information Science, vol 51, no 13, 2000 November.
6 Ulukbek, I., Ng, K.B. & Kantor, P. (2001) Exploration of a Geometric Model of Data Fusion
7 Ulukbke, I., Ng, K.B. & Kantor, P. (2001) Counter Intuitive Cases in Data Fusion. Also published in the Proceedings of the 2001 ASIS Annual Meeting.

Four Online Related Writing:

Kantor, P.B. (1995) Decision level data fusion for routing of documents in TREC3 context: A Best case analysis of worst case results. In D. Harman (ed.) Proceedings of the 3rd Text Retrieval Conference. Washington. DC: GPO.

Ng, K.B. and Kantor, P.B. (1996). Two experiments on retrieval with corrupted data and clean queries in TREC 4 adhoc task environment: Data fusion and pattern scanning. In D. Harman (ed.) Proceedings of the 4th Text Retrieval Conference. Washington. DC: GPO.

Ng, K.B. , Loewenstern, D., Basu, C., Hirsh, H. & Kantor, P. (1997). Data fusion of machine learning methods for the TREC-5 routing task (and other works). In D. Harman (ed.) Proceedings of the 5th Text Retrieval Conference. Washington. DC: GPO.

Ng, K.B., Kantor, P.B. (1998). An Investigation of the Conditions for Effective Data Fusion in IR: A Pilot Study. Proceedings of the 61th Annual Meeting of the American Society for Information Science.
 




Last Revision: 4/12/01