|
EXERCISE 1 Find the average word frequency statistic k for words in our noise word negative dictionary. Also compute it for an equal number of the most frequently occurring words used as index terms for the AIT corpus. How well can varying k be used to discriminate functional from content-descriptive terms?
EXERCISE 2 Derive a statistical test for goodness of fit with a Poisson distribution.
EXERCISE 3 Collect word frequency statistics for all (unstemmed) tokens in the AIT corpus, and identify the 100 most frequently occurring words. Then contrast this set with the words in STOP.WRD. Which words are common to both sets? Which words are very common in AIT but not already part of the negative dictionary? Which words are part of the negative dictionary but do not occur frequently in AIT?
EXERCISE 4 Pick k keywords randomly. Plot the distribution of coordination levels for all documents matching at least one of these. Now repeat this experiment 10 times and plot the mean with standard deviation bars. Iterate this exercise for 1 < k < 10.
EXERCISE 5 What fraction of all total postings are pruned by prank?
EXERCISE 6 For one long and one short query, vary NAccum from 10 to 100 percent of NDoc (in increments of 10 percent) and analyze the resulting retrieval performance.
EXERCISE 7 As presented, the thresholds pruning some postings from consideration in the prank() algorithm is very sensitive to A*, the best matching documents score. Replace this dependence with one on the average of the top k documents scores.
EXERCISE 8 As presented, the insert threshold (used to determine when new accumulators are added) ignores how full the queue already is. So, for example, postings might not initially be added even when the queue is entirely empty. (There is also a near-bug in the code, arising when the smallest element of the queue is popped even though it may be larger than the newscore being added.) Propose, implement, and test a new decision rule that is sensitive to both the fullness of the queue and A°, its smallest elements score.
EXERCISE 9 Describe sensitivities of the prank algorithm to document length normalization. That is, under which conditions might the document ordering performed under under unnormalized posting frequencies be invalid?
|