|
EXERCISE 1
- How much would the probability p6 of a six-character words occurrence change if case is not folded?
- What would happen if we allowed hyphens?
- For extra credit, discuss how the issues of capitalization and hyphenation interact (for example, in the case of politically correct schoolchildrens names, like SMITH-JONES-HARTLEY-FRANK).
- What does this imply about the utility of proper names as potential index terms?
EXERCISE 2 On the whole, Zipfs law matches word ranking distributions extremely well, but individual words deviate from this somewhat. Derive a measure of this deviation, of a words actual frequency from that predicted by Zipfs law, based strictly on character length; also consider rank-based deviation. Use it to identify those words that occur more (less) frequently than we would expect based on their length alone.
EXERCISE 3 The plot shown in Figure 5.5 would seem to involve a great deal of computational effort. For each k, compute the SVD using k dimensions. Is there a way we can exploit the computation for SVD in k dimensions for k + 1 or k 1?
EXERCISE 4 How much memory is required to store Uk? What is the complexity of the computation required to compute it? Making reasonable assumptions about a query load, estimate the load this would make on a query server. Can you suggest a parallel architecture that would be especially appropriate for this purpose?
EXERCISE 5 As more and more documents are added, how much drift in vocabulary is likely? How much before it significantly degrades performance?
EXERCISE 6 Do you see any way to reconcile the models of Croft, Fuhr, Buckley, et al. with those of Muntz, Ribeiro, Wong, and Yao? Consider, for example, how pseudo-cosine similarity measures (cf. Equation 5.25) fare with respect to this constraint on Bayesian node independence.
|