EXERCISE 1

  1. How much would the probability p6 of a six-character word’s occurrence change if case is not folded?
  2. What would happen if we allowed hyphens?
  3. For extra credit, discuss how the issues of capitalization and hyphenation interact (for example, in the case of politically correct schoolchildren’s names, like SMITH-JONES-HARTLEY-FRANK).
  4. What does this imply about the utility of proper names as potential index terms?

EXERCISE 2  On the whole, Zipf’s law matches word ranking distributions extremely well, but individual words deviate from this somewhat. Derive a measure of this deviation, of a word’s actual frequency from that predicted by Zipf’s law, based strictly on character length; also consider rank-based deviation. Use it to identify those words that occur more (less) frequently than we would expect based on their length alone.

EXERCISE 3  The plot shown in Figure 5.5 would seem to involve a great deal of computational effort. For each k, compute the SVD using k dimensions. Is there a way we can exploit the computation for SVD in k dimensions for k + 1 or k –1?

EXERCISE 4  How much memory is required to store Uk? What is the complexity of the computation required to compute it? Making reasonable assumptions about a query load, estimate the load this would make on a query server. Can you suggest a parallel architecture that would be especially appropriate for this purpose?

EXERCISE 5  As more and more documents are added, how much drift in vocabulary is likely? How much before it significantly degrades performance?

EXERCISE 6   Do you see any way to reconcile the models of Croft, Fuhr, Buckley, et al. with those of Muntz, Ribeiro, Wong, and Yao? Consider, for example, how pseudo-cosine similarity measures (cf. Equation 5.25) fare with respect to this constraint on Bayesian node independence.