In basing a theory of evaluation on the theory of measurement, is it possible to devise a measure of effectiveness not starting with precision and recall but simply with the set of relevant documents and the set of retrieved documents? If so, can we generalise such a measure to take account of degree of relevance? An alternative derivation of an E-type measure could be done in terms of recall and fallout.
Is there any advantage to doing this?
Up to now the measurement of effectiveness has proved fairly intractable to statistical analysis.
This has been mainly because no reasonable underlying statistical model can be found, however, that is not to say that one does not exist!*
There may be 'laws' of retrieval such as the well known trade-off between precision and recall that are worth establishing either empirically or by theoretical argument.
It has been shown that the trade-off does in fact follow from more basic assumptions about the retrieval model.
Similar arguments are needed to establish the upper bounds to retrieval under certain models.
There is a need for more intensive research into the problems of what to use to represent the content of documents in a computer.
Information retrieval systems, both operational and experimental, have been keyword based.
Some have become quite sophisticated in their use of keywords, for example, they may include a form of normalisation and some sort of weighting.
Some use distributional information to measure the strength of relationships between keywords or between the keyword descriptions of documents.
The limit of our ingenuity with keywords seemed to have been reached when a few semantic relationships between words were defined and exploited.
The major reason for this rather simple-minded approach to document retrieval is a very good one.
Most of the experimental evidence over the last decade has pointed to the superiority of this approach over the possible alternatives.
Nevertheless there is room for more spectacular improvements.
It seems that at the root of retrieval effectiveness lies the adequacy (or inadequacy) of the computer representation of documents.
No doubt this was recognised to be true in the early days but attempts at that time to move away from keyword representation met with little success.
Despite this I would like to see research in IR take another good look at the problem of what should be stored inside the computer.
* I think the Robertson model described in Chapter 7 goes some way to being considered as a reasonable statistical model.