So far in this book we have made very little use of probability theory in modelling any sub-system in IR.
The reason for this is simply that the bulk of the work in IR is non-probabilistic, and it is only recently that some significant headway has been made with probabilistic methods[1,2,3].
The history of the use of probabilistic methods goes back as far as the early sixties but for some reason the early ideas never took hold.
In this chapter I shall be describing methods of retrieval, i.e. searching and stopping rules, based on probabilistic considerations.
In Chapter 2 I dealt with automatic indexing based on a probabilistic model of the distribution of word tokens within a document (text); here I will be concerned with the distribution of index terms over the set of documents making up a collection or file.
I shall be relying heavily on the familiar assumption that the distribution of index terms throughout the collection, or within some subset of it, will tell us something about the likely relevance of any given document.
Perhaps it is as well to warn the reader that some of the material in this chapter is rather mathematical.
However, I believe that the framework of retrieval discussed in this chapter is both elegant and potentially extremely powerful*.
Although the work on it has been rather recent and thus some may feel that it should stand the test of time, I think it probably represents the most important break-through in IR in the last few years.
Therefore I unashamedly make this chapter theoretical, since the theory must be thoroughly understood if any further progress is to be made.
There are a number of equivalent ways
* This was recognised by Maron in his 'The Logic Behind a Probabilistic Interpretation' as early as 1964.