In this chapter I shall attempt to present a coherent account of classification in such a way that the principles involved will be sufficiently understood for anyone wishing to use classification techniques in IR to do so without too much difficulty.
The emphasis will be on their application in document clustering, although many of the ideas are applicable to pattern recognition, automatic medical diagnosis, and keyword clustering.
A formal definition of classification will not be attempted; for our purposes it is sufficient to think of classification as describing the process by which a classificatory system is constructed.
The word 'classification' is also used to describe the result of such a process.
Although indexing is often thought of (wrongly I think) as 'classification' we specifically exclude this meaning.
A further distinction to be made is between 'classification' and 'diagnosis'.
Everyday language is very ambiguous on this point:
'How would you classify (identify) this?'
'How are these best classified (grouped)?'
The first example refers to diagnosis whereas the second talks about classification proper.
These distinctions have been made before in the literature by Kendall and Jardine and Sibson.
In the context of information retrieval, a classification is required for a purpose.
Here I follow Macnaughton-Smith who states:'All classifications, even the most general are carried out for some more or less explicit "special purpose" or set of purposes which should