Collaborative Proposal Submitted to the NSF 99-2 – Information and Data Management/IIS/CISE and Computation and Social Systems/IIS/CISE

Project Title: Supporting Effective Access through User- and Topic-Based Language Models

Submitting Institutions:

University of Massachusetts
Department of Computer Science
Amherst, MA 01003

Rutgers University
School of Communication, Information and Library Studies
New Brunswick, NJ 08901-1071

Principal Investigators:

W. Bruce Croft
Department of Computer Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, Massachusetts 01003-4610
413-545-0463
croft@cs.umass.edu
http://ciir.cs.umass.edu

James Allan
Department of Computer Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, Massachusetts 01003-4610
413-545-3240
allan@cs.umass.edu
http://www.cs.umass.edu/~allan

Nicholas J. Belkin
School of Communication, Information and Library Studies
Rutgers University
4 Huntington Street
New Brunswick, NJ 08901-1071
732-932-8585
nick@belkin.rutgers.edu

 

Collaborative Research: Supporting Effective Access through User- and Topic-Based Language Models

A. Project Summary

In today’s networked information environment, tools to support information retrieval and filtering have become common. Despite the general utility and popularity of these tools, in many important respects their performance is mediocre. Text search engines and agent-based filtering systems make mistakes that are obvious and aggravating to users, and relevant documents are usually mixed with many others that are totally unrelated. These problems significantly lower the productivity and effectiveness of people using the tools, whether in education, science, business, or government. We believe that the fundamental issue that underlies all of these problems is the lack of adequate models of the user and the domain. In order to achieve breakthroughs in retrieval and filtering accuracy, the tools need to be able to use more information about the context of the query, better models of the user, and more knowledge about the domain.

User models and models of topics or domains are not new. A number of studies in the past 20 years have examined different approaches and implementations. In general, these studies did not have a significant impact on the design of retrieval and filtering systems, despite the obvious relevance of user modeling to such systems. We believe that some reasons for this lack of impact are that previous studies were unable to specify precisely how such models would be used to affect performance, that there were severe problems with how the data for such models would be elicited, and that there was no well-defined structure within which such models could be implemented.

In this proposal, we describe a new approach to user and domain or topic modeling that has the potential of significantly improving the effectiveness of information access and filtering. This approach is based on recent research on language models for information retrieval. In this approach, it is assumed that associated with every document or group of documents there are one or more probability distributions that model how the text in the document can be generated. This generative model is quite different from the standard probabilistic retrieval models and has a number of advantages. The key advantages for this project are that language models appear to capture the important aspects of user and domain modeling that have been observed in earlier experiments, and that retrieval techniques based on document language models have been shown to be very effective.

The project we propose combines the expertise and experience of one group in the development and testing of information retrieval models and systems, with that of another in user modeling and user studies in interactive systems. These two groups have a history of successful collaboration in related domains, which provides a solid basis for the proposed collaborative project. We describe a number of research issues, potential solutions, and a comprehensive experimental program that will establish the impact of the proposed approach. The evaluation of the new techniques can be done partly using standard collections like TREC, but will also involve a number of user studies in a laboratory setting, and studies of the impact on an operational Web search application with large numbers of users.

B. Table of Contents

Project Description
-Overview and General Plan
-Language Models for Effective Access
-User Models and Language Models
-Evaluation

References

C. Project Description

C.1 Overview and General Plan

In today’s networked information environment, tools to support information retrieval and filtering have become common. Collaborative work environments and software agents are also available in a variety of forms. Despite the general utility and popularity of these tools, in many important respects their performance is mediocre. Some of the specific problems are as follows:

We believe that the fundamental issue that underlies all of these problems is the lack of adequate models of the user and the domain. In order to achieve breakthroughs in retrieval and filtering accuracy, the tools need to be able to use more information about the context of the query, better models of the user, and more knowledge about the domain. In order to be more autonomous, agents need to have more accurate models of users and domains. Collaborative work tools need to support the acquisition, exchange, and comparison of user and domain models to facilitate communication and collaboration.

User models are not new. Neither are models of domains or topics. A number of studies, particularly in the 1980s, examined different approaches and implementations of these ideas (Rich, 1979; Belkin et al, 1982; Daniels et al 1985; Croft and Thompson, 1987; Brajnik et al, 1987; Fox, 1987; Brooks, 1987; Kobsa and Wahlster, 1989; Croft and Das, 1990; Belkin, 1997). This research, although improving the understanding of how people interact with information systems, did not result in practical approaches to acquiring and using models beyond simple stereotypes (e.g. Rich, 1979; Croft and Thompson, 1987). Relevance feedback and other learning techniques have also been proposed and tested as a means of understanding users and improving the performance of information systems (Salton and McGill, 1983). These techniques were shown to be very effective in early experiments (Salton and Buckley, 1990) but have not scaled effectively to full text systems and are not currently viewed as reliable in large systems such as Web search engines. Collaborative filtering (Resnick et al, 1994) is another technique for acquiring user models that has interesting applications but is too limited to solve the problems mentioned.

In this proposal, we describe a new approach to user and domain or topic modeling that has the potential of significantly improving the effectiveness of information access, as well as providing a basis for more autonomous information agents. This approach is based on Ponte and Croft’s recent work on language models for information retrieval (Ponte and Croft, 1998; Ponte, 1998). In this work, it is assumed that associated with every document or group of documents there are one or more probability distributions that model how the text in the document can be generated. This generative model is quite different from the standard probabilistic classification models described in Van Rijsbergen (1979) and Turtle and Croft (1992), and has the following advantages:

It is this last feature of language models that we will investigate in this proposal. We believe that this approach can provide more rigorous and effective techniques to capture the ideas and concepts described in earlier work on user modeling such as Belkin et al (1982), Daniels et al (1985), Croft and Thompson (1987), Croft et al (1989), and Croft and Das (1990). This earlier work suggested the importance of capturing word and phrase associations to capture a user’s context and to represent domains, but the approaches used were ad-hoc, the testbeds were very limited, and the general information infrastructure at the time was not conducive to user experiments. The combination of the language modeling approach, the large testbeds developed for TREC, and the new Internet infrastructure has opened up new opportunities to provide effective solutions to the critical problem of user and domain modeling.

We propose to address this general problem by leveraging the complementary expertise and experience of our two different research groups. The group at the University of Massachusetts will be primarily concerned with development of the language model approach and its application to the problems of user and domain modeling, and with associated evaluation. The group at Rutgers University will be primarily concerned with the problems of elicitation and construction of user models during interactive information retrieval, with establishing how people do, and could use topic models in their searching, and with evaluation of the techniques which are developed in experimental settings. Each group will use the results of the other in its own work, and they will work together in designing and conducting a large-scale evaluation of the combined results in an operational environment. Although each group will manage its own parts of the project, the project as a whole will be coordinated by the team of three Co-PIs. This will be accomplished through a regular schedule of visits of Co-PIs and other research staff between the two sites (at least one two-day visit every other month), and by development of an overall plan of coordination of work as is specified in section C.4. The two groups have a history of successful collaboration (e.g. Belkin, Cool, Croft & Callan, 1993; Belkin & Croft, 1987 & 1992; Belkin, et al., 1998), and a history of joint and related work on problems of user modeling in information retrieval (e.g. Belkin, Seeger & Wersig, 1983; Belkin, et al. 1987; Croft & Thompson, 1987).

In the next section, we describe in more detail how language models can be used to improve information retrieval and represent domain models. We then propose how the language model approach can be used to capture user models and how these models would be used in a retrieval system. Section C.4 outlines the experiments and methodology for evaluating the proposed research.

C.2 Language Models for Effective Access

Over the past three decades, probabilistic models of document retrieval have been studied extensively. In general, these approaches can be characterized as methods of estimating the probability of relevance of documents to user queries. One component of a probabilistic retrieval model is the indexing model, i.e., a model of the assignment of indexing terms to documents.

A well-known example of an indexing model is the 2-Poisson model, due to Bookstein and Swanson (1976) and also to Harter (1975). By analogy to manual indexing, the task was to assign a subset of words contained in a document (the "specialty words") as indexing terms. The probability model was intended to indicate the useful indexing terms by means of the differences in their rate of occurrence in documents "elite" for a given term, i.e., a document that would satisfy a user posing that single term as a query, vs. those without the property of eliteness.

The success of the 2-Poisson model has been somewhat limited but it should be noted that Robertson's tf, which has been quite successful, was intended to behave similarly to the 2-Poisson model (Robertson and Walker, 1994). Other researchers have proposed a mixture model of more than two Poisson distributions that has been shown to better fit the observed data. Despite this, the n-Poisson model has not brought about increased retrieval effectiveness. In any event, the semantics of the underlying distributions are less obvious in the n-Poisson case as compared to the 2-Poisson case where they model the concept of eliteness.

Apart from the adequacy of the available indexing models, estimating the parameters of these models is a difficult problem. Rather than making parametric assumptions, as is done in the 2-Poisson model where it is assumed that terms follow a mixture of two Poisson distributions, in the language modeling approach, as Silverman (1985) said, "the data will be allowed to speak for themselves". The language modeling approach also avoids the notion of eliteness. It was assumed that a document elite for a given term would satisfy a user if the user posed that single term as a query. Since that time, the prevailing view has come to be that multiple term queries are more realistic. In general, this requires a combinatorial explosion of elite sets for a possible subsets of terms in the collection. We take the view that each query needs to be looked at individually and that documents will not necessarily fall cleanly into elite and non-elite sets.

The phrase "language model" is used by the speech recognition community to refer to a probability distribution that captures the statistical regularities of the generation of language (Yamron, 1997). In the context of the retrieval task, we treat the generation of queries as a random process. Generally speaking, language models for speech attempt to predict the probability of the next word in an ordered sequence. For the purposes of document retrieval, one can model occurrences at the document level without regard to sequential effects and obtain good retrieval results. It is also possible to model local predictive effects for features such as phrases (Ponte, 1998). Regarding query generation as a random process, it is not the case that queries really are generated randomly, but it is the case that retrieval systems are not endowed with knowledge of the generation process. Instead, language generation is treated as a random process modeled by a probability distribution and focus on the estimation of probabilities as a means of achieving effective retrieval.

The approach to retrieval described in Ponte and Croft (1998) is to infer a language model for each document and to estimate the probability of generating the query according to each of these models. Documents are then ranked according to these probabilities. By focusing on the query generation probability as opposed to the probability of relevance, this model does not require a set of inferences for indexing and a separate set of inferences for retrieval.

Most retrieval systems use term frequency, document frequency and document length statistics. Typically these are used to compute a tf.idf score with document length normalization (Robertson and Walker, 1994). In the language modeling approach, collection statistics such as term frequency, document length and document frequency are integral parts of the language model and do not have to be included in an ad hoc manner. This, and the absence of indexing probabilities, distinguish the language modeling approach from other probabilistic retrieval models (e.g. Robertson and Sparck Jones, 1977; Fuhr, 1989; Wong and Yao, 1989; Turtle and Croft, 1992).

The score for a document in the simple unigram model used in Ponte and Croft (1998) is given by:

where

is the estimate of the probability that a query will be produced by a language model for a given document, and

is the probability of producing the terms in the query

Much of the power of the model comes from the estimation techniques used for these probabilities, which combine both maximum likelihood estimates and background models. This part of the model benefits directly from the extensive research done on estimation of language models in fields such as speech recognition (Manning and Schutze, 1999). More sophisticated models that make use of bigram and even trigram probabilities are described in Ponte (1998) and are currently being investigated. Even with the simple model, retrieval experiments showed significant effectiveness improvements relative to sophisticated tf.idf systems (Ponte and Croft, 1998).

Topic-Based Language Models

The idea of a language model representing the text written in specific documents leads directly to the possibility of using language models to represent topics in domains and users’ views of domains. We will discuss the user modeling aspect in the next section, and focus here on techniques for acquiring and using language models that represent topics in a domain.

A crucial part of achieving effective retrieval is establishing a context for the query. The query "Star Wars" can be interpreted very differently in the context of missile defense systems rather than Hollywood films. Many approaches have been tried to identify and use context, mostly in the form of query expansion techniques. For example, the Local Context Analysis technique developed at U.Mass. (Xu and Croft, 1996) identifies words and phrases associated with the query context by analyzing retrieved documents. This technique, although one of the most successful in terms of improving retrieval effectiveness, is ad-hoc and cannot distinguish multiple contexts for a given query. We believe that the language model approach provides a more principled way of describing and using context that will lead to substantially more effective retrieval.

We propose that language models for important contexts or topics will be based on groups of similar documents. We call these topic models to distinguish them from models based on individual documents. To generate topic models for a set of documents, the documents would first need to be clustered or grouped, and then a model could be estimated for each group. Note that this represents a different form of the clustering hypothesis (Van Rijsbergen, 1979), which states that closely associated documents tend to be relevant to the same requests. Instead, we are assuming that closely associated documents will have the same underlying language model. A variation of this approach would be to cluster document passages and allow multiple topic models to be associated with a given document.

We do not expect the choice of clustering algorithm used to group documents to have a major impact on the quality of the topic models generated. We have used both K-Means and average-link clustering (Manning and Schutze, 1999) in a recent study on language models for distributed search (Xu and Croft, 1999), and in the proposed project we would compare their performance again. We would also investigate a probabilistic clustering algorithm that makes more direct use of the language modeling approach. Instead of using an ad-hoc similarity function to compare documents, we could compare the underlying language models or probability distributions using the Kullback-Leibler "distance" measure (Manning and Schutze, 1999). The idea would be to group documents whose language models are similar, and then infer a topic model from that group. We also plan to look at a language model version of K-Means.

Once the clusters have been generated, we plan to represent topics using a bigram model. This means that the important unigrams (words) and bigrams (phrases) would be identified and frequency information used to estimate probabilities. We believe that a limited form of bigram model is the most appropriate for information retrieval tasks, in that should only be necessary to model a limited amount of the sequential nature of the text (simple phrases). This hypothesis, and the form of the estimation and backoff techniques needed for maximum effectiveness, will be tested in the course of this project.

Given a set of topic models derived from document clusters, the next task would be to select the model or models most appropriate for a given query. This could be done automatically by calculating the probability that a topic model could generate the query, similar to retrieval. This would give a ranked list of topics and one or more could be selected from the top. Alternatively, the users could select topics from descriptions provided by the system. Both of these techniques will be evaluated in this project.

After one or more topic models have been chosen as the context for the query, the issue is how to use them to improve the retrieval process. The topic models should not replace the query but instead should augment it. This is similar to typical query expansion algorithms that produce a new query containing the original query and a downweighted set of expansion terms (Xu and Croft, 1996). In the project, we will investigate probabilistic approaches to retrieval using a context topic model. One possibility would be to first compare document language models to the topic models and only calculate the query generation probabilities for document models that match the context. Another possibility would be to use the context to add terms to the query, similar to a local feedback or expansion algorithm (Xu and Croft, 1996). Ponte (1998) showed that this approach can be very effective.

In Ponte’s method for local or relevance feedback, terms in the identified set of documents (either the top ranked set or some initial relevant documents) were ranked according the sum of the log ratios of the probabilities derived from the document language models and the collection or background model. This is:

where Tn is the set of top ranked documents (or the relevant documents), P(t|Md) is the probability of term t given the document language model, cft is the raw count of term t in the collection, and cs is the raw collection size. In addition, to model the co-occurrence component of the Local Context Analysis technique, a co-occurrence model was added to the computation of the ranking of the terms. The final ranking was calculated using:

where is the probability of t occurring given an occurrence of a query term q, estimated as follows:

In the research proposed for this project, we will look at extensions to Ponte’s approach for topic models. Instead of estimating probabilities and log ratios from individual document models, we can use topic models and compare retrieval results.

To summarize, the basic steps in the proposed process for acquiring and using topic models in retrieval are:

  1. Cluster an initial set of documents. This can be either a corpus, a subset of a corpus (such as documents identified as a class in a classification hierarchy), or a set of documents retrieved by a query.
  2. Form topic models based on the document clusters.
  3. Identify appropriate topic models for a query.
  4. Augment the retrieval process using those topic models.

Each of these steps will require significant research, as described. Some of the main issues to be studied will be:

The resulting techniques will bear some resemblance to query expansion techniques. We expect the language model techniques to perform significantly better, because of the new underlying probabilistic framework and the success of initial experiments reported in Ponte (1998). Another major difference is that our aim is to build topic models that are valid over time, can be used for many queries, and can be said to represent domain knowledge. Query expansion techniques do not, in general, attempt to build this type of representation. Note that when topic models are used interactively with users, this opens up another major research issue:

Given that a language model is a probability distribution of unigrams and bigrams, explaining what this "means" to a user could be a challenge. Both visual and textual presentation techniques may be appropriate. An example of a textual approach would be to summarize the language model by the sentences that have the highest probability of being generated by the model, in other words, the most "typical" sentences. Although sentence-based summaries have been studied extensively (e.g. Tombros and Sanderson, 1998), the language model approach is a new perspective that may produce better results. Visualization techniques, such as those described in Leouski and Allan (1997), could present a graphical representation of the important words and phrases in the model. This issue, which involves the user’s perception of language models, brings us to the discussion of user models.

C.3 User Models and Language Models

User models have an enormous potential for impacting the effectiveness of information systems. We have already mentioned the importance of understanding the context of a query, and we can expect that the context will, to some extent, depend on the user who asked the query, rather than being determined solely by the corpus. Belkin, Brooks & Daniels (1987) have described the various phases in intermediary dialogues and shown that modeling the user’s goals, information needs, and domain knowledge is an essential part of an effective interaction. Belkin (1980) suggested that a model of the state of knowledge of the user was even more important than a query in identifying text that would satisfy the information need. His ASK (Anomalous State of Knowledge) model was the basis of a system that represented user models as a graph of connected words and phrases, where a connection indicated a strong relationship derived from text provided by the user (Belkin, Oddy & Brooks, 1982; Belkin & Kwasnik, 1986). This was, in fact, a simple language model used to represent a user’s state of knowledge on a topic.

The I3R system (Croft and Thompson, 1987) used "stereotypes", domain models, and request models in a process designed to support the major aspects of an intermediary interaction. The concept of stereotypes was based on the work of Rich (1979) who used a variety of profiles to recommend books in a library setting. In the information retrieval environment, stereotypes were found to be of limited utility, except to configure the interface and system parameters for level of expertise. Brajnik, Guida & Tasso (1990) discussed a system which modeled various aspects of the user in an information retrieval system, by assigning the user, both through direct elicitation and through inference, to various values on characteristics such as domain expertise, familiarity with information retrieval, and goals. This work went somewhat beyond stereotypes, but still depended upon assigning users to categories. In user experiments based on the capabilities of the I3R system, Croft and Das (1990) showed that user domain models could have a significant impact on effectiveness. In contrast to stereotypes, these models were designed simply to acquire the main words, phrases and associations that a person made in response to a specific topic. This, once again, is a simple form of language model that, in this case, was derived directly by interaction with the user.

Agent-based systems have been proposed for filtering information or finding information on specific topics. One of the supposed advantages of these systems is the ability to act autonomously. To do this, however, the agent must have an effective model of the user and how they would respond to the information that is found. Most current systems rely on relatively simple user profiles based on Boolean combinations of features or weighted combinations derived from user feedback. This type of model is difficult to generalize to other queries and therefore little autonomy is developed over time in terms of responding to new situations.

To summarize this research on user models, we can roughly characterize the types of user models that have been suggested for information retrieval or information filtering purposes according to the type of user characteristic they are modeling (e.g. user goals, user experience with system, user knowledge of the domain, domain of user problem, and so on), according to whether they are long-term or short-term, according to how the data are elicited (direct or inferred), and according to how the model is structured (stereotype or individual description). The types of user models that have had the most impact on the effectiveness of retrieval systems, and that appear to be most robust, have been relatively simple, individually constructed models of the user’s view of a domain, or of the user’s state of knowledge with respect to the domain of interest. Such models, in general, are based upon the words and phrases that a user associates with a topic, no matter whether they are long- or short-term, nor or how they are elicited. We believe that language and topic models, as previously described, are the right basis for generating user domain models of this sort, since they provide a probabilistic framework for generating words and phrases in response to a topic. But rather than acquiring the models solely from the corpus, as was done in the last section, we need to acquire models that reflect individual views of domains and topics.

We propose to investigate four approaches to acquiring user language models. We expect to acquire these user models over time and that the acquisition will be driven by the queries submitted to the system. For a given user, then, there will be a number of models representing different topics. We therefore refer to these models as user-based topic models. We can characterize them as long-term, individually constructed models of the domain of the user’s interest(s). Our research foci will be on methods for elicitation of the data for constructing such models, and on specific techniques for their structure, representation, and use for retrieval purposes. We will also consider the possibility of using language model techniques to represent short-term, single searching episode user topic models.

The simplest elicitation method is to ask the user for more words about the domain. For instance, Croft and Das (1990) prompted users for words and phrases that they considered to be associated with a topic. In this approach, models must be accumulated over time as a person uses a system. This means that for many new queries, the user model will not be able to provide context. The evidence from Croft and Das, however, suggests that the "association dialogue" with the system can provide significant effectiveness benefits even in the absence of prior associations in the user model. A major challenge for this approach will be to make the association dialogue acceptable to users. Direct elicitation dialogues offer the advantage of being dependent solely upon user associations, but they suffer from the well-known problem that searchers have great difficulties in making such unprompted associations (Belkin, Marchetti & Cool, 1993). This will be a major research issue in our project.

As an example of an association dialogue, consider the query "hydroelectric projects in overseas countries". Immediately after the query had been entered, the system would ask the user to provide additional words and phrases related to their interest. The user might respond with "dams", "cost overruns", "corruption" and "South-East Asia", which would indicate a very different context than a response of "dams", "ecological damage", "Quebec", "wildlife". Both contexts are equally valid and documents about ecology and corruption could be in the same corpus. This indicates the potential advantage of user-based topic models, which can be very specific, instead of corpus-based topic models, which will tend to be more diverse and general. This example also indicates one of the challenges of user models, which is how to build representations that are useful over time. The user interested in corruption and cost overruns may later generate a query "South-East Asian economy and stock markets". Simply adding the terms "dam" and "corruption" to the query because they were previously associated is unlikely to be an effective strategy. Instead, we need to generate a context for the new query that includes topic models from the corpus modified by what we know about the user. For example, the interest in "corruption" and "cost overruns" may indicate a likely preference for a topic model that represents documents dealing with government interventions in the South-East Asian economies and stock market, rather than a model that represented documents about the impact of the South-East Asian economy on the U.S. The issue of how to incorporate user models in the retrieval process is discussed later.

A second approach to acquiring user models will use topic models to suggest associations to the user and then accept changes and additions. The advantage of this approach is the ability to construct models based on the corpus that will apply to many potential queries, rather than requiring an association dialogue for every new query. This approach once again raises the issue of how to present language and topic models to users. Simply listing words and phrases that are characteristic of the model may not be adequate. For example, here is a list of words and phrases produced by the Local Context Analysis technique (Xu and Croft, 1996) for the query "hydroelectric projects in overseas countries":

This list, although it contains useful phrases that are related to the query topic, may be difficult for a person to absorb quickly because it contains a large variety of words that cover many aspects of the topic and are not connected in any coherent manner. On the other hand, it is a convenient form for the user to delete phrases that do not match their view of the topic and add new phrases. For example, the person interested in projects in South-East Asia could rapidly select "Laos" and "Vinh Son" and delete "Hungary" and "Rio Arriba County". Work by Koenemann (1996) and by Belkin, et al. (1996, 1998) investigating relevance feedback as a term suggestion device indicates that users can choose effective search terms from such list, and that such prompting, and especially related reading of retrieved texts, leads to addition of new, non-suggested terms to the query. We will investigate combining the list of phrases presentation with richer structures to give more coherence to the presentation. We also plan to look at whether useful hierarchical summaries could be generated by showing typical sentences and lists of phrases from subclusters of the cluster used to form the topic model. To the extent that we can usefully display, and users can successfully choose increasingly complex topic models, this method addresses the problem of lack of immediate context of the first method.

The third approach to acquiring user models will be similar to collaborative filtering. Based on a query or an initial associative dialogue, similar models generated by other users of the system could be suggested and modified. This is similar to the second approach, except that the models would come from other users rather than the corpus. One of the goals of this project will be to see if users are willing to generate and modify topic models or whether they would rely mostly on system-generated models. An obvious problem with this method, which will be a focus of investigation, is measuring similarity between models.

The final approach that we will examine is using the documents that are viewed during searching to construct models. This approach assumes that the user is more interested in the documents that are viewed than the other documents found during a search, and that the user’s state of knowledge is affected by what is read. By keeping a record of what documents are read, clusters of these documents can periodically be constructed and user-based topic models inferred from the clusters. This is similar to the construction of topic models from a corpus, but is based on specific subsets of documents chosen by the user to read. For example, if a person enters a series of queries on construction projects (such as the hydroelectric projects query mentioned above), but never reads any of the Federal Register documents retrieved by the search system, the user model that is generated will not contain words that are representative of this type of government document. A topic model constructed solely from the corpus would, on the other hand, tend to favor these documents since they are typically much longer than other types of documents. A potential advantage of this method is that it might be possible to generate useful user-based topic models for single search episodes, rather than only long-term models. The results of such interaction might be equivalent to an interactively constructed Anomalous State of Knowledge (Belkin, Oddy & Brooks, 1982), and is related to Oddy’s (1977) idea of directing search through interaction with documents.

Given user models constructed using one or more of these approaches, the next issue is how they would be incorporated into the retrieval process. Our current view is that user models should be combined with topic models to select appropriate contexts for queries. As in the case of using topic models alone, however, we do not yet know the best method for using the models and a substantial part of the research will be devoted to this topic. User models could be used to augment queries, select topic models, or augment topic models. Some of the user-based topic models for a given user could be combined to form a model that is representative of a larger topic area.

The first step in any approach would be to compare user models and topic models to the query. Given that user models will tend to be much smaller in terms of the number of words and phrases that are explicitly mentioned, they will typically be ranked lower than topic models. On the other hand, because they are acquired from the user, they are more important. A user model that ranks highly should probably be used to directly augment the query. A lower ranked user model could be used to augment or select between topic models. The selected topic model(s) would then be used to identify candidate documents as described in the last section. This general approach can be viewed as a language model version of query expansion and cluster-based retrieval (described in Van Rijsbergen, 1979). Figure 1 gives an overview of the whole process.

 

 propos1.jpg (19083 bytes)

Figure 1: Overview of retrieval process with user- and topic-based language models

C.4 Evaluation

In this proposal, we have described a number of research issues that will be addressed by a combination of development of the underlying probabilistic framework based on language models, development of techniques for eliciting, representing and using topic and user models, and experimentation. In this section, we outline the general structure of the project, the experiments that will be carried out and how they are related to one another, and the resources that will be needed.

We propose to combine the expertise of the University of Massachusetts group in information retrieval systems and algorithms, in evaluation of information retrieval techniques, and in language modeling for information retrieval, with that of the Rutgers University group in developing user models, in conducting exploratory user studies, and in experimental evaluation of interactive information retrieval systems. Work at the two institutions will begin in parallel, with the initial studies at the University of Massachusetts being concerned with development of the language model framework, and using it to generate corpus-based topic models, and those at Rutgers being concerned with investigating various techniques for elicitation of data for user models. The results of this initial series of studies will be integrated, and used as the basis of the second series of experiments and related studies. At the University of Massachusetts, these will be concerned with testing the "goodness" of topic models, and at Rutgers, they will be concerned with evaluating the presentation and use of topic and user models in interactive information retrieval. The results of these studies will then be again integrated, to be the basis for evaluation of topic and user models in a large-scale operational environment at the University of Massachusetts, and for experimental evaluation of the effectiveness of the representation and comparison techniques at Rutgers.

The first series of experiments involves the topic model research described in section C.2. To reiterate, the major issues here are:

We plan to study these issues with experiments based on the TREC corpora, queries, and relevance judgements. Except for the presentation techniques, all of these issues can be resolved using the standard recall-precision type of experiment to measure the relative effectiveness of different algorithms and language model representations. As baselines for these experiments, we have the INQUERY retrieval system, which is a very effective implementation of an older probabilistic framework (Turtle and Croft, 1992), the Local Context Analysis query expansion technique (Xu and Croft, 1996), and the document-based language model approach implemented by Ponte (1998). We plan to implement a new language model-based system to incorporate document-, topic-, and user-based language models and to facilitate experiments comparing alternative approaches to retrieval.

These experiments do interact and a number of iterations will be required before final conclusions can be reached. For example, consider the issue of the relative effectiveness of clustering algorithms for generating topic models. Although a number of statistics about number of clusters, cluster size, etc. can easily be compared, the quality of topic models generated by K-means, average-link or a probabilistic algorithm cannot be measured directly. Instead, we need to measure the impact of the topic models produced on the overall effectiveness of the system. To do this, we need to decide on a particular form of topic model, the algorithm to select topic models, and the method of incorporating topic models in the retrieval process. In the early stages of the experimental "loop", we will have to make initial choices about a clustering algorithm that produces reasonable models in order to study the other issues. After initial choices have been made of topic model selection and retrieval algorithms, the issue of clustering algorithms can be revisited and studied using overall system effectiveness. This will probably result in changes in the clustering algorithm that will in turn mean that other experimental results will need to be revalidated. This iterative process, although quite complex, is similar in nature to many previous experiments we have carried out.

Studying the effectiveness of presentation and summarization techniques is very different than comparing retrieval algorithms. Although some studies have been done in the TIPSTER program (e.g. Sanderson, 1998), the task was quite different than summarizing language models. Initial studies can be carried out simply by comparing the perceived quality of the output, as judged by the researchers. Definitive results, however, will require user studies to determine whether the presentations are indeed useful. A critical part of such user studies is defining the task that the presentation or summarization is supporting. In our case, we have defined the tasks as selecting and/or modifying topic models that are related to a query. Given this task, we can measure the success of the outcome by testing the impact of the topic model choice on the retrieval effectiveness. We can also preconfigure the choice of topic models so that some will be known to be correct for a query and compare user choices on that basis.

The research issues that involve user modeling will require more extensive user studies. The main issues in this area are:

The second of these topics can be evaluated using TREC collections. Once appropriate user models have been identified, alternative techniques for augmenting the query, selecting topic models, or augmenting topic models can be compared using standard recall-precision tests. The four techniques discussed for user model acquisition, however, will rely heavily on studies where users will be involved in acquisition dialogues or modifying previous models. There are three ways to carry out these studies. One involves the definition of tasks and controlled user experiments, as in the TREC Interactive Track (Over, 1998) to test each aspect of the proposed acquisition process. Our experimental studies of user model elicitation will follow the general pattern of the TREC Interactive Track, in which the effectiveness and usability of each method of elicitation and use of user models will be tested in a controlled environment using the standard measures that have been developed by the Rutgers group during their TREC, and related studies (cf. Belkin, 1998).

This approach to user studies, however, cannot be the only type that can be used to test our ideas. It is, in general, very difficult, in an experimental framework, to have enough users involved who have real information problems, and who will be available over the long periods of time and multiple information seeking episodes that we require. This type of problem results in a lack of critical mass, a lack of feedback, and inconclusive results. Some of the techniques proposed for user model acquisition assume the existence of a number of other user models, and the utility of user models and topic models will be determined by what happens over longer periods of time. In other words, a major issue is whether the knowledge represented in these models be reused successfully. Some of this can be simulated in experiments, but it is very difficult to include a realistic time factor in laboratory settings.

To augment experimental studies which will be carried out at both U.Mass. and Rutgers, we propose to pursue two other directions. One is exploratory observational studies at Rutgers, of two types. In one, subjects will be recruited to conduct at least part of their ordinary information seeking during the course of a semester using specially constructed interfaces which use the different elicitation techniques, and their experiences will be recorded and analyzed as in previous work of this group (e.g. Koenemann, et al., 1995). These studies will be augmented by more controlled observations of subjects using the various methods with assigned, rather than personally-motivated information problems. Such studies will be used in all three cycles of the project.

The other direction we will pursue will be to exploit the popularity of Web-based systems to attract large numbers of users to a prototype, but useful implementation of a search engine that would incorporate aspects of our user- and topic-modeling research. The Center for Intelligent Information Retrieval has implemented or supported a number of such systems and we have considerable experience in this area. One candidate system for this research is called GovBot, which provides access to government web sites, is heavily used, and is supported totally at U.Mass. As we learn more about topic models and user models, we plan to incorporate some of the most effective techniques into the GovBot system, monitor the use of these techniques, and provide mechanisms for user feedback. Although this approach does not provide the same type of results as controlled experiments, the large number of users provide a testbed in which there is the possibility of studying the impact of user models over time. This work will be based on the collaborative results from both groups.

In summary, we have proposed a comprehensive approach to evaluating the impact of user- and topic-models on information access. By combining standard recall-precision experiments using TREC data, experimental and observational user studies in laboratory settings, and studies of the impact in operational environments with many users, we expect to significantly advance our understanding of this important area.

D. References

N.J. Belkin, (1980) Anomalous states of knowledge as a basis for information retrieval. Canadian Journal of Information Science, v. 5: 133-143.

N.J. Belkin, (1997) User modeling in information retrieval. Tutorial presented at UM 97, Sixth International Conference on User Modelling. http://www.scils.rutgers.edu/~belkin/belkin.html/

N.J. Belkin, (1998) An overview of results from Rutgers’ investigations of interactive information retrieval. In P.A. Cochrane and E.H. Johnson, eds. Visualizing subject access for 21st century information resources. Champaign-Urbana IL: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 45-62.

N.J. Belkin, C.L. Borgman, H.M. Brooks, T. Bylander, W.B. Croft, (and eight others) (1987) Distributed expert-based information systems: An interdisciplinary approach. Information Processing and Management, v. 23: 395-410.

N.J. Belkin, H.M. Brooks, and P.J. Daniels, (1987) Knowledge elicitation using discourse analysis. International Journal of Man-Machine Studies, v. 27: 127-144.

N.J. Belkin, C. Cool, W.B. Croft, J.P. Callan, 1993. "The effect of multiple query representations on information retrieval system performance," Proceedings of SIGIR 93, p. 339-346.

N.J. Belkin and W.B. Croft, (1987) Retrieval techniques. Annual Review of Information Science and Technology, v. 22: 109-146.

N.J. Belkin and W.B. Croft, (1992) Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, v. 35, no. 12: 29-38.

N.J. Belkin, & B.H. Kwasnik, (1986) Using structural representations of anomalous states of knowledge for choosing document retrieval strategies. In Proceedings of the 1986 ACM Conference on Research and Development in Information Retrieval. Pisa: The Conference, 11-22.

N.J. Belkin, P.G. Marchetti, P.G. & C. Cool, (1993) BRAQUE: Design of an interface to support user interaction in information retrieval. Information Processing and Management, v. 29: 325-344.

N.J. Belkin, R.N. Oddy, H.H. Brooks, 1982. "ASK for information retrieval: Part I: background and theory; Part II: Results of a design study." Journal of Documentation, 38(2-3), p. 61-71; 145-164

N.J. Belkin, J. Perez Carballo, C. Cool, S. Lin, S.Y. Park, S.Y. Rieh, P. Savage, C. Sikora, H. Xie, and J. Allan, (1998) Rutgers TREC-6 interactive track experience. In E. Voorhees and D. Harman, eds. The Sixth Text Retrieval Conference (TREC-6). Washington, D.C.: GPO, 597-610.

N.J. Belkin, T. Seeger, and G. Wersig, (1983) Distributed expert problem treatment as a model for information system analysis and design. Journal of Information Science, v. 5: 153-167.

N.J. Belkin, et al. (1996) Using relevance feedback and ranking in interactive information retrieval. In E. Voorhees and D. Harman, eds. The Fourth Text Retrieval Conference (TREC-4). Washington, D.C.: GPO, 181-210.

N.J. Belkin, et al. (1998) Rutgers’ TREC-6 interactive track experience. In. E. Voorhees and D. Harman, eds. The Sixth Text Retrieval Conference (TREC-6). Washington, D.C.: GPO, 597-610.

G. Brajnik, G. Guida, C. Tasso, 1987. "User modeling in intelligent information retrieval", Information Processing and Management, 23(4), p. 305-320.

G. Brajnik, G. Guida, & C. Tasso, (1990) User modelling in expert man-machine interfaces: a case study in intelligent information retrieval. IEEE Transactions on Systems, Man and Cybernetics, v. 20: 166-185.

A. Bookstein, D. Swanson, 1976. "Probabilistic models for automatic indexing." Journal of the American Society for Information Science, 25(5), p. 312-318.

H.M. Brooks, 1987. "Expert systems and intelligent information retrieval." Information Processing and Management, 23(4), p. 367-382.

W.B. Croft, R.H. Thompson, 1987. "I3R: A new approach to the design of document retrieval systems." Journal of the American Society for Information Science, 38(6), p. 389-404.

W.B. Croft, T.J. Lucia, J. Cringean, P. Willett, 1989. "Retrieving documents by plausible inference: An experimental study," Information Processing and Management, 25, p. 599-614.

W.B. Croft and R. Das, 1990. "Experiments with query acquisition and use in document retrieval systems." Proceedings of ACM SIGIR ’90, p. 349-365.

W.B. Croft, H. Turtle, D. Lewis, 1991. "The use of phrases and structured queries in information retrieval," Proceedings of SIGIR 91, p. 32-45.

P.J. Daniels, H.M. Brooks, N.J. Belkin, 1985. "Using problem structures for driving human-computer dialogues". Proceedings of RIAO ’85, p. 131-149.

E.A. Fox, 1987. "Development of the CODER system: A testbed for artificial intelligence methods in information retrieval." Information Processing and Management, 23(4), p. 341-366.

N. Fuhr, 1989. "Models for retrieval with probabilistic indexing." Information processing and Management, 25(1).

S.P. Harter, 1975. " A probabilistic approach to automatic keyword indexing." Journal of the American Society for Information Science, 24.

A. Kobsa & W. Wahlster, eds. (1989) User models in dialog systems. Berlin: Springer Verlag.

J. Koenemann, (1996) Relevance feedback: Usage, usability, utility. Ph.D. Dissertation, Department of Psychology, Rutgers University, New Brunswick, NJ.

J. Koenemann, R. Quatrain, C. Cool, & N.J. Belkin, (1995) New tools and old habits: The interactive searching behavior of expert online searchers using INQUERY. In D. Harman, ed. The Third Text Retrieval Conference (TREC-3). Washington, D.C.: GPO, 145-177.

A. Leouski, J. Allan, 1997. "Evaluating a visual navigation system for a digital library," Proceedings of the Second European Conference on Research and Technology for Digital Libraries.

C. Manning, H. Schutze, 1999. Foundations of Statistical Natural Language Processing. MIT Press.

R.N. Oddy, (1977) Information retrieval through man-machine dialogue. Journal of Documentation, v. 33: 1-14.

P. Over, (1998) TREC-6 interactive report. In E. Voorhees and D. Harman, eds. The Sixth Text Retrieval Conference (TREC-6). Washington, D.C.: GPO, 73-82.

J. Ponte, W.B. Croft, 1998. "A Language Modeling Approach to Information Retrieval." Proceedings of the 21st International Conference on Research and Development in Information Retrieval, p. 275-281.

J. Ponte, 1998. A Language Modeling Approach to Information Retrieval. Ph.D. thesis, Computer Science Department, University of Massachusetts.

P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl. 1994. "GroupLens: an open architecture for collaborative filtering of netnews." Proceedings of CSCW, p. 175-186.

E. Rich (1979) User modelling via stereotypes. Cognitive Science, v. 3: 329-354.

E. Rich, 1979. Building and Exploiting User Models. Ph.D. Thesis, Carnegie-Mellon University Technical Report No. CMU-CS-79-119.

S.E. Robertson, K. Sparck Jones, 1977. "Relevance weighting of search terms." Journal of the American Society of Information Science, 27.

S.E. Robertson, S. Walker, 1994. "Some simple effect approximations to the 2-Poisson model for probabilistic weighted retrieval." Proceedings of ACM SIGIR ’94, p. 232-241.

G. Salton, M. McGill, 1983. Introduction to Modern Information Retrieval. McGraw-Hill.

G. Salton, C. Buckley, 1990. "Improving retrieval performance by relevance feedback." Journal of the American Society for Information Science, 41(4), p. 288-297.

B.W. Silverman, 1985. Density Estimation for Statistics and Data Analysis. John Wiley and Sons.

A. Tombros, M. Sanderson, 1998. "Advantages of query-biased summaries in information retrieval." Proceedings of ACM SIGIR ’98, p. 2-10.

H.R. Turtle, W.B. Croft, 1992. "A comparison of text retrieval models." Computer Journal, 35(3), p. 279-290.

C.J. Van Rijsbergen, 1979. Information Retrieval, Second edition, Butterworths, London.

S.K.M. Wong, Y. Yao, 1989. "A probability distribution model for information retrieval." Information Processing and Management, 25(1), p. 39-53.

J. Xu, W.B. Croft, 1996. "Query expansion using local and global document analysis," Proceedings of ACM SIGIR '96, p. 4-11.

J. Xu, W.B. Croft, 1999. "Cluster-based language models for distributed retrieval." Submitted to ACM SIGIR ’99.

J. Yamron, 1997. "Topic detection and tracking segmentation task." Proceedings of the DARPA Topic Detection and Track Workshop.

Mongrel Home