There are five commonly used measures of association in information retrieval.
Since in information retrieval documents and requests are most commonly represented by term or keyword lists, I shall simplify matters by assuming that an object is represented by a set of keywords and that the counting measure | . | gives the size of the set.
We can easily generalise to the case where the keywords have been weighted, by simply choosing an appropriate measure (in the measure-theoretic sense).
The simplest of all association measures is
|X Y| Simple matching coefficient
which is the number of shared index terms.
This coefficient does not take into account the sizes of X and Y.
The following coefficients which have been used in document retrieval take into account the information provided by the sizes of X and Y.
These may all be considered to be normalised versions of the simple matching coefficient.
Failure to normalise leads to counter intuitive results as the following example shows:
then |X1| = 1 |Y1| = 1 |X1 Y2| = 1 => S1 = 1S2 = 1
|X2| = 10 |Y2| = 10 |X2 Y2| = 1 => S1 = 1S2 = 1/10
S1 (X1, Y1) = S1 (X2, Y2) which is clearly absurd since X1 and Y1 are identical representatives whereas X2 and Y2 are radically different.
The normalisation for S2, scales it between ) and 1, maximum similarity being indicated by 1.
Doyle hinted at the importance of normalisation in an amusing way: 'One would regard the postulate "All documents are created equal" as being a reasonable foundation for a library description.
Therefore one would like to count either documents or things which