Rutgers' TREC-8 Interactive Track Experience”


Authors, Affiliations


Abstract


  1. Introduction


Continuing our program of studying different methods of query expansion in interactive information retrieval (IR), this year our group investigated the effects of varying methods of term suggestion for user-controlled query expansion. The two methods that we compared were user control over suggested terms, implemented as positive relevance feedback (RF), versus magical term suggestion, implemented as a form of Local Context Analysis (LCA). We chose these two since they exemplify the two major methods of interactive query expansion. The effects that we were most interested in were in terms of user preference, usability (as indicated by effort), and effectiveness in task performance.


Previous investigations by us (e.g. Koenemann, 1996) and others (e.g. ) have indicated that users in IR and similar systems generally prefer to have some measure of control on what the system does for them. This has often been in conjunction with an expressed desire to understand how the system has come to its suggestions/actions. These kinds of results led us to conclude that in interactive IR relevance feedback (RF) is best implemented as a term-suggestion device, rather than as an automatic query expansion device. In TREC-8, we decided to investigate the issues of control and understanding of system operation in more detail, by comparing a system in which users could control (and therefore presumably understand) where system-suggested terms came from (using positive RF), with one in which suggested terms appeared as if by magic. Based on the previous work in this area, we hypothesized that user-controlled term suggestion would be prefered to system-controlled term suggestion.


As do others, we believe that a more usable system is a better system, and futher, that a good indicator of usability is the amount of effort (physical, cognitive) that a person has to expend in order to complete a given task. We hypothesized that system-controlled term suggestion would require less effort on the part of the user than one which asked the user to make relevance judgments in order to get suggested terms. Such a difference is indicated by total time taken to perform the task, by the number of documents that a person looks at or reads, by the amount of use of various system features, and by the extent to which system-suggested terms are incorporated into the queries.


The TREC-8 Interactive Track task of instance idenfication is one which asks users to identify a number of topically different documents. Since RF is based on the idea of contructing an ever better (e.g. more specific) query, and since RF in interactive IR is typically based on a relatively small number of documents, it seems that RF term suggestion based only on positive relevance judgments is not well suited to this task. We can call such term suggestion directed. However, the terms identified by LCA for query expansion are based on a system-defined set of documents, as well as characteristics of the terms in the colllection as a whole. Compared to RF, such term suggestion can be characterized as diffuse. We hypothesized that for the instance recall task, diffuse term suggestion would be more effective than directed term suggestion, and therefore that users would perform better in the LCA system than the RF system. The standard measure of performance in the TREC-8 Interactive Track task is instance recall, defined as the proportion of instances of a topic that have been identified by the TREC judges, which have been identified by the searcher (as indicated by the documents the searcher has saved). Since the task that was set the searchers was to identify and save all of the instances of a topic, and since we are interested in developing evaluation measures for interactive IR that do not depend upon external relevance (and related) judgments, we also measured performance according to the number of documents saved, and the number of instances identified.


Thus, we suggest that although a term-suggestion feature based on RF might be preferred by users to one which is based on LCA, for reasons of control and understanding, the magical method will require less effort, and will lead to better performance in the instance identification task.

  1. System Descriptions


There were two experimental IR systems used in this study. Both systems used Inquery 3.1p1 with its default values for indexing and retrieval. The sole difference between the two systems lies in the implementation of the term suggestion feature (this leads also to a minor difference in the interfaces).


The first system, called INQ-RF, allowed users to make positive relevance judgments on documents. Inquery's RF function was modified so that it displayed a list of terms for positively judged documents, rather than automatically expanding the query. As users made RF judgments about documents, the top n terms were presented in a term suggestion window. The number of terms displayed was determined by the formula:
n = 5i + 5
in which i is the number of judged documents, and n is no greater than 25. The term ranking algorithm was rdfidf (Haines and Croft, 1993), where rdf is the number of relevant documents in which the term appears, and idf is normalized inverse document frequency as used by Allen (1995) (cf. Belkin, et al., 1999).


The second system, INQ-LCA, employed a slight modification of the technique called Local Context Analysis (LCA) (Xu and Croft, 1996) for term suggestion. LCA combines collection-wide identification of concepts, normally nouns and noun phrases, with co-occurrence analysis of those concepts with query terms in the top n passages retrieved by a query. The concepts are ranked according to a function of the frequencies of occurrence of the query terms and co-occurring concepts in the retrieved passages, and the inverse passage frequencies for the entire collection of those terms and concepts. The top m ranked concepts are then used for query expansion. In our version of LCA, these m (m=25, to match the RF condition) concepts were displayed in a term suggestion window, after each new query. Based on an experiment using the TREC-7 ad hoc task in which we compared performance of automatic LCA query expansion using different values of n and different definitions of passages (with m constant at 25), passage in our study was defined as the whole document, and n was set to 10.1


Both systems used the same basic interface, developed at Rutgers, which offers the functions and features described below. Appendix A is a screen dump of the INQ-RF interface. The INQ-LCA interface was identical, except that there were no check boxes to indicate positively judged documents, and no Clear Good Docs button. Suggested terms could be added to the existing query at the user's discretion, which is the same for both systems.


Both systems ran on a SUN Ultra 140 with 64MG memory and 9GB disk under Solaris 2.5.1 with a 17"”color monitor.



  1. Description of Study




  1. Results: Descriptive Statistics


The interaction between the searcher and the system was similar for the two systems. The mean number of documents retrieved during a search topic was almost identical for LCA (M = 653.59) and RF (M = 655.49). The number of iterations (queries) in a search was roughly the same for the two systems (LCA M=5.93, RF M=5.76). Consistent with the data for documents retrieved, mean number of unique titles displayed for a search topic was also equivalent for LCA and RF (123.53 and 128.81, respectively). The searchers viewed the full text of almost 20% of the unique document titles displayed in each system (LCA M = 21.88 and RF M = 25.41). The similarity between the systems, in terms of interaction, was demonstrated by the number of instances identified and documents saved. The mean number of instances identified by a searcher for a particular topic was 9.36 for LCA and 9.75 for RF. The mean cumulative number of documents saved was 8.66 for LCA and 8.69 for RF. Given that multiple instances could be found in one document, more instances were identified than documents saved in both systems. Although the total number of suggested terms was similar for LCA and RF (M = 126.61 and M = 113.03, respectively), the number of unique suggested terms provided by LCA and RF differed substantially (M = 62.73 and M = 22.81, respectively). System errors generally did not occur in either system (LCA M = .01 and RF M = .01).


Feature use is another aspect of system interaction. Feature use was fairly comparable in the two systems with the exception of the additional actions required to obtain suggested terms in the RF system. In the RF system, the average number of documents that were identified as relevant and used to generate suggested terms was 2.85 per search topic. Searchers generally did not use the feature to clear the suggested terms list or uncheck a document as relevant (M = .21 and M = .74, respectively). On average for both systems, searchers cleared the query window no more than one time per search topic (LCA M = .95 and RF M = .87). Searchers generally did not change their mind and ‘unsave’ a document in either system (LCA M = .17 and RF M = .18). Across the two systems, searches used similar amounts of paging-style scrolling (LCA M = 22.26 and RF M = 25.44) and dragging-style scrolling (LCA M = 1.36 and RF M = 1.80). Although the total number of unique terms used in the query by the searcher was similar for LCA and RF (M = 10.0 and M = 8.91, respectively), the average number of suggested terms that the user selected to use in their query was very different (LCA M = 4.41 and RF M = 1.87). Overall, these descriptive statistics demonstrate similar interactions for the two system.



  1. Results: Preference, Effort, Performance


Hypothesis 1: User-control (RF) will be preferred to system-control (LCA).


System preference was measured by subjective response to the following question:
Which of the systems did you like best overall?” System preference was distributed roughly evenly across the RF (39%), LCA (31%) and no difference (31%) categories.

There were no significant effects of order in which the systems were used or of performance in the task on this result. Hypothesis 1 is thus rejected.


Hypothesis 2: LCA will require less effort than RF


There was little difference in subjective responses to questions intended to measure effort on the two systems. When asked which system they found easier to learn to use, seventy-five percent of subjects indicated that there was ‘no difference’ between the two systems. The remainder of the subjects were closely split between preferences for the two systems (LCA = 14% and RF = 11%). When asked which system was simply easier to use, fifty percent of subjects expressed no preference for one system over the other. The other fifty percent were again closely divided in their preferences between the two systems (LCA = 22% and RF = 28%). When the question focused on the ease of using the systems' term suggestion feature, only twenty-five percent indicated no clear preference. Of those searchers who had a preference, LCA's term suggestion feature was indicated as preferred more often (LCA = 42% and RF = 33%). There was no system order effect on these results.


The effort associated with interacting with the two systems was similar based on the use of features, number of iterations (queries), and the viewing of items. Neither page-style scrolling nor dragging-style scrolling yielded significant differences between the two systems [t(214) = -1.26, ns and t(214) = -1.06, ns, respectively]. The number of iterations (queries) in a search was roughly the same for the two systems (LCA M=5.93, RF M=5.76). The difference between the two systems was also insignificant for total number of documents viewed, total number of unique documents viewed, total number of titles displayed and total number of unique titles displayed [t(214) = -1.71, ns; t(214) = -1.68, ns; t(214) = -1.14, ns; t(214) = -.59, ns; respectively].


The total number of query terms used in a single query was roughly equivalent regardless of the system the user was using (LCA M = 10.0, RF M = 8.91). However, the way in which the terms were acquired for use in the query did vary across systems. The number of suggested query terms selected by the user was significantly higher when using the LCA system compared to the same users searching on the RF system, t(214) = 4.50, p < .001. The number of terms entered into the query by the user, those not selected from the suggested terms list, was significantly higher for RF than LCA, t(214) = 2.04, p < .05. This suggests that in the RF system users spent more effort generating terms themselves, while in the LCA system users spent less effort thinking of terms and selected more terms from those provided.


Based on the measure of effort defined as having to think of good query terms, hypothesis 2 is accepted.


Hypothesis 3: LCA (diffuse term suggestion) will be more effective than RF (directed term suggestion).


Performance was measured by aspectual recall, number of instances identified and number of documents saved. The total aspectual recall for the two systems was close (LCA M=.24, RF M=.26), as was the number of instances (LCA M=9.36, RF M=9.75). For number of documents saved, subjects' performance was almost identical (LCA M=8.48, RF M=8.49). These differences were all insignificant [t(214) = -.69, ns; t(214) = .34, ns; t(214) = -.06, ns; respectively], which suggests that the effectiveness of the two systems is similar.


There was little difference in the subjective response to a question intended to measure effectiveness of the two systems. When asked which of the systems'’terms they found more effective, forty-two percent of subjects indicated that RF suggested more helpful terms, thirty-three percent of subjects indicated that LCA suggested more helpful terms and twenty-five percent of subjects indicated that there was no difference in the helpfulness of the terms suggested by the two systems. As might be expected, there was a significant correlation between subjects' performance in a system and their preception of effectiveness.


Hypothesis 3 is thus rejected.



  1. Discussion and Conclusions


Two out of three of our hypotheses were rejected, and the third was supported based on only one measure out of several. What should we make of these results?



[SORRY, RAN OUT OF STEAM HERE. MORE TO COME ON FRIDAY]

1David Harper has pointed out to us that there is an inconsistency in our using the ad hoc task in these experiments, since that task is quite different from the instance recall task, especially in ways that might be relevant to choice of number of passages to be examined.