| Implementation: Storing document lengths. The fact that the keywords total document frequency Dk cannot be known until the entire corpus has been processed suggests a second pass. In some cases, lengths are computed for each document and stored in a doclen file, separate from the main kw-inv inverted postings file. (It would also be possible to normalize the frequencies fkd by document lengths and store this quotient in the postings file, but that would require the storage of floats rather than small integers.) For the small corpora we consider here, it proves easier to simply compute these values as the inverted index is read in, prior to the first matching against queries. |