White Paper
Paul B. Kantor
Rutgers
This note addresses some complexities which cannot be avoided, and suggests a way to approach them with some logical coherence. It is not intended to oppose the "poor but simple" thrust, but to help us focus attention on what aspects of the evaluation problem should be respected on the road to honest poverty.
Scenarios.
For an evaluator, scenarios play the same role that the "market basket" plays for the economist. Our basket of scenarios needs to contain some examples of each of the "economic necessities" of users of digital libraries. This raises two problems: (1) identifying the necessities, abstract classes (2) deciding which specific instances should be used to represent each of the important classes. Whatever method of scoring performance is eventually selected, it must be applicable to elements from each class independently. The problem of producing a single overall measure can then be "evaded directly" by encouraging each user of the evaluation process to select a rule of combination (for example, a weighted sum of the scores on several types of tasks) with a form (the specific weights) which reflects the importance of the corresponding benefits and costs to that particular user of the evaluation. The MWG may of course suggest some typical baskets and weights representing say, an elementary school teacher, an industrial spy, or a disaster relief planner. I am working with Linda Hill of UCSB to develop a classification scheme for GIS tasks, which already has some 6 distinct dimensions.
It is important, however, that the suite of tasks retain its essential "vectorial" character, so that the basic elements of score are available to those who would reweight them in a different way. This ensures that the effort of measurement, which will be substantial, is not lost in some simple condensation of all that is known.
It is important to seek (if there are any) the ceteris paribus monotonic contributors to the overall "goodness" of a system. In the language of Orr, these are parts which, when done more well, can only make the overall system do "more good". A natural candidate would be improvements in the speed of algorithms which accomplish the same calculation on the same processor. Perhaps, from the human side, improvements in monitor resolution, at no added cost, would fall into the same category. On the other hand, squeezing more information onto the screen might or might not bean improvement from the user’s point of view.
Measurement and observation.
Much current discussion of digital libraries, in their working environments, emphasizes the importance of "observation". Observation is quite open-ended, and can lead wherever the skills and insights of the investigator are able to take it. [Belkin and collaborators at Rutgers]. It is the essence of qualitative method, when done well. Observation is an essential step in measurement. In fact it occurs twice: once in deciding what to measure, and once in making the observations that provide that measure.
A central issue in developing measures for digital libraries is that "performance" viewed even as an abstraction, is not a "quality" but a "relation". For example, the acceleration of a sports car is given in terms of the number of seconds it takes to reach 60 mph. but this is simply one point on the curve of speed versus time (or distance). The reality is more complex. Operating manuals for airplanes give several key speeds for climbing, corresponding to maximum rise versus distance, maximum rise versus time, and maximum rise versus fuel expended. And an airplane (at least a Cessna 172) is a good deal simpler than a digital library.
At a minimum, I suggest that we approach each dimension by looking for a pair of variables, whose relation over the entire range of likely use, characterizes the performance of a library, or of a component of that library. Generically, we can call these "Effect versus Effort" curves. If the system is "efficient" in a certain natural sense, these curves will be concave, as shown in the example below. That is, the greatest benefit comes early, and further benefit is available for those who wish to persist into regions of diminishing returns.
The Human in the Loop
For measuring how a system interacts with the human in the loop, this concept specializes to selecting a particular measurable definition of the effect, and of the effort. If the users are "sufficiently homogeneous" then the time they must expend is probably a good surrogate for effort. For effect we hope to go beyond a naive counting of the number of retrieved items that might be judged, by some experts, to suit the stated purpose or problem. Ideal would be surrogates for an end-to-end approach, in which there is some task to be performed, and the measure of effect corresponds to seeing how much the digital library advances that task, or how well the task is performed.
It is also possible to devise inverse schemes (as was used in the TREC5 Confusion Track) in which the task is to find a known (in that setting, presumably "vaguely remembered" document. The measure of effort (for a basket of tasks) is the total effort required of the user. The measure of effect is the total number of target documents found. [See Voorhess and Kantor, TREC5 proceedings]. With a body of representative users, it is possible to convert the end-to-end approach into a group of known item tests. That is, we can develop estimates of how well the task is done when item I is found, for each of several useful items I, and then track the effort required to unearth each of them. We must note that items will, in most real cases, be not single documents, but sets of documents which together support a better solution to the task problem.
When it becomes necessary to compare several different systems, each of which has an "operating curve" as shown in the figure, there are many issues to be decided. Among them are: micro averaging versus macro-averaging; counting of cases (which has a tractable statistical basis) versus summative averaging (which doesn’t), as well as the issues about weighting mentioned above. These are troublesome, but they are manageable in simple spreadsheet type layouts, which permit various users of the evaluation to select among schemes for comparison. An interesting wrinkle is to consider the relation among the three lines shown in the graph. The lowest represents the performance of a "mindless" system. The highest represents an ideal system. The middle one represents the performance of this system. The ratio of the length of the short arrow to the long one is a clean statement of how far the system carries us from mindlessness to excellence. In the spirit of Data Envelopment Analysis, one might change the slope of the lines to make each system look as good as possible. Some systems will thus emerge as "non-dominated" or "Pareto Optimal" while others simply are not best in any situation.
Distributed Systems
The curve of effect versus effort can be defined, and perhaps more clearly, for addressing the effectiveness of distributed systems. In this case there is a natural "gold standard" which is the fully integrated system. So the "effect" axis can be measured as "fraction of the integrated system performance which is achieved". The effort axis is still complicated, as it must combine the cost, processor requirements, bandwidth, management, stability, etc. etc. of the distributed scheme. An example of this kind of measure can be applied to the distributed indexing schemes discussed by Dolan et al at UCSB.