Semantic Dimensionality and Effective Data Fusion in Information Retrieval
Principal Investigator: Paul Kantor *
Co-Principal Investigator: Kwong Bor Ng *** School of Communication, Information and Library Studies, Rutgers University
Contact Information
Paul Kantor
4 Huntington Street, SCILS, New Brunswick, NJ 08901
Kwong Bor Ng
65-30 Kissena Blvd. Flushing, Rosenthal Library Room 254, GSLIS, Queens College, NY 11378
Phone: (718) 997-3613; Fax : (850) 997-3797
Email:
WWW PAGE
scils.rutgers.edu/~kbng/NSF98/SemanticDimensionalityProject.html
List of Supported Students and Staff (optional)
Ibraev Ulukbek, Research Assistant
Project Award Information
Award Number: IIS-9812086
Duration: Period of Performance of the entire project, e.g., 9/15/1998 - 9/14/2001.
Title: Semantic Dimensionality and Effective Data Fusion in Information Retrieval
Keywords
Data Fusion, Effectiveness Prediction, Information Retrieval, Meta-search Engines, Semantic Dimensionality
Project Summary
We investigate the effectiveness of data fusion for information retrieval. This problem will become ever more important as information must be found in complex networked environments. It is unlikely that any one system will solve the problem of finding the best and most useful sources and documents. Data fusion, widely used in image processing and signal detection, has been shown in other settings, where the "noise" is largely random, to give substantial performance improvements. This work is based on theories developed by the investigators, and using a large collection of existing data developed at the Text Retrieval Conferences (TREC), to find laws or rules which predict which retrieval schemes should be combined, and how, to provide improved performance.
Publications and Products
Publication: (1) Ng, K.B., Kantor, P.B. (1998).
An Investigation of the Conditions for Effective Data Fusion in Information Retrieval: A Pilot Study. Proceedings of the 61th Annual Meeting of the American Society for Information Science. (2) Ulukbek, Ibraev, K.B.Ng, Kantor, P. Exploration of a geometric model of data fusion. Submitted to SIGIR Annual Conference 2000, under reivew. (3) K.B.Ng, and Kantor, P. Predicting the Effectiveness of Naive Data Fusion on the Basis of System Characteristics. Submitted to Journal of American Society for Information Science, under review. APLab Technical Reports : An Investigation of Two Predictive Variables for Effective Data Fusion in Information Retrieval: Part 1: Comparison of Three Statistic Analysis Methods; Part 2: Predictive Power of Two Parametric Analysis Methods and One Non-Parametric Analysis Method.Project Impact
Goals, Objectives, and Targeted Activities
The results of this research will be predictive models for deciding when two schemes (or more than two) can be effectively used in DF, to improve IR. Research will be conducted by applying discriminant analysis, non-linear clustering techniques, and other statistical methods to identify the form of the function f, and the most powerful predictive variables. Extensive use is made of the "Receiver Operating Characteristic" concept to determine whether one model is absolutely, or only conditionally, more powerful than another. According to our theory of Semantic Dimensionality, we have tentatively proposed two conditions for effective data fusion (Ng and Kantor 1998). The results support our theory of semantic dimensionality. We continue to work on a geometric model of DF. Essentially the model proposes that for a given problem, there is, in some large abstract space, a very best solution. Any particular real system which is at hand produces a result which can be represented by another point in that abstract space. If the system is quite good this representative point should be quite close to the ideal point. If the system is bad this representative point will be quite far away. Given a good system and a bad system we can then ask whether there is some point on the line joining them in this abstract space which is even closer to the ideal point than the good system. The answer depends on the angle between the good system and the bad system, as seen from the ideal point in the abstract space. If this angle is small, the bad system will be farther away than the good system, and close enough to its azimuth so that the line joining them is everywhere farther from the ideal point. On the other hand, if the angles subtended by the two systems, as viewed from the optimal point, is 90 degrees then there is some point on the line connecting them which is closer to the optimal point. We are now working on locating optimal points on the line for different IR schemes and topics.
GPRA Outcome Goals
While the work is significant in its own right, in broadening our understanding of how different schemes for IR relate to one another, it also has theoretical and practical implications reaching beyond the field of IR research. For purposes of theory, this research will provide an extensive critical test of the theory of Semantic Dimensionality, which has been developed by the PI. This theory (Kantor, 1998A, 1998b) posits that problems of machine learning, including the problem of defining an appropriate retrieval scheme for a given problem or purpose, can be thought of as the problem of locating a "best point" in a Euclidean space whose dimension is a characteristic of the problem. We will test these ideas and, should they prove successful, will provide a foundation for their extension to other areas of machine learning and artificial intelligence. With regard to application, the proposed research has immediate application to improving the methods of document discovery and retrieval used in a host of applications from corporate, to military, to personal. The research will be conducted using test sets from the large (nearly one million documents) TREC collections, and are thus representative of the kinds of schemes which are found in commercial software for searching the Internet, intranets, public databases of full text, and private databases of text.
Project References
Kantor P.B., Blankenbecler R, Cherikh M. (1988) Sensor Calculus. Tantalus Technical Report Tantalus/CT-88/3. Available from Tantalus Inc. 362 N. 4th Ave. Highland Pk NJ 08904.
Kantor, P.B. (1994) Information retrieval technique. Annual review of information science and technology. Vol 29, pp. 53-90
Kantor, P.B. (1995)
Decision level data fusion for routing of documents in TREC3 context: A Best case analysis of worst case results. In D. Harman (ed.) Proceedings of the 3rd Text Retrieval Conference. Washington. DC: GPO.Kantor P. B. (1998a) Semantic dimension: On the effectiveness of naive data fusion methods in certain learning and detection problems. APLab Technical report.
Kantor P. B. (1998b) Semantic dimension: On the effectiveness of naive data fusion methods in certain learning and detection problems. Persented at the 5th Conference on Applications of Mathematics in Artifiical Intelligence. Jane 3-4, 1998.
Ng, K.B. and Kantor, P.B. (1996).
Two experiments on retrieval with corrupted data and clean queries in TREC 4 adhoc task environment: Data fusion and pattern scanning. In D. Harman (ed.) Proceedings of the 4th Text Retrieval Conference. Washington. DC: GPO.Ng, K.B. , Loewenstern, D., Basu, C., Hirsh, H. & Kantor, P. (1997).
Data fusion of machine learning methods for the TREC-5 routing task (and other works). In D. Harman (ed.) Proceedings of the 5th Text Retrieval Conference. Washington. DC: GPO.Ng, K.B., Kantor, P.B. (1998).
An Investigation of the Conditions for Effective Data Fusion in IR: A Pilot Study. Proceedings of the 61th Annual Meeting of the American Society for Information Science.Area Background
Data fusion is a relatively new concept. It is an approach which combines data, evidence, or decisions coming from or based on various sources, of different natures, about the same set of objects, in order to increase the quality of decision making under uncertainty about the objects. Generally there are three level of data fusion. On the primary data level, all the information available to the detecting systems is considered together in the fusion process to make an overall estimate. On the attribute level, primary signals detected from the objects by the detecting systems are processed into a set of specific attributes, and decisions about the objects are made according to an optimal decision rule based on all such attributes. On the decision level, each detecting system individually makes its own partial decision about the objects, using its own data, and according to its own criteria, and a final decision is made based on these partial decisions. Data fusion often involves multiple imperfect sensors and each of the sensors contributes its own estimation to the final decision. In IR, data fusion does not necessarily employ different IR systems, but only different IR schemes.
Area References
Kantor, P.B. (1995)
Decision level data fusion for routing of documents in TREC3 context: A Best case analysis of worst case results. In D. Harman (ed.) Proceedings of the 4th Text Retrieval Conference. Washington. DC: GPO.Ng, K.B., Kantor, P.B. (1998).
An Investigation of the Conditions for Effective Data Fusion in Information Retrieval: A Pilot Study. Proceedings of the 61th Annual Meeting of the American Society for Information Science.