|
This week's Challenge Question:
John and Tom are enjoying a day off fishing. Tom notices that John brought along an English translation of the epic Tolstoy novel, War and Peace. Tom chuckles and asks John "We could use that book as an anchor! How many pages is it?" John answers "About 1400 pages, it's quite interesting". Tom thought for a moment and then replied "If you rearranged the words in that book I bet you'd have more than 100 pages of just the word "the". Is Tom correct? Why or why not?
And the Answer is...(ANSWER CORRECTED 8/19/09 11:39 AM)
Tom is correct. War and Peace consists of 560,000 words on 1400 pages (in John's paperback). In the English language, "the" tends to be the most frequent. The frequency at which it occurs in a corpus can be estimated by the following formula:

Where k is the frequency rank of the word in question, in this case k=1, since "the" is the most common English word. The s is the exponent for the power law distribution that defines word frequency in the corpus. This value is usually close to 1 and for a rough estimate it's ok to say s~1. HN,s is the Nth Harmonic Number. The general equation for calculating the nth Harmonic Number is:

N is the number of unique terms found in the book. A good rough guess for this number is less than 100,000 unique words (the less unique words there are, the higher the percentage of "the", so we are calculating the minimum number of pages with so large a number. The real number is probably more like 10,000 to 20,000.
Plugging these estimates into equation 1 we get:
HN,s = 12.09; s=1; k=1
f~1/12.09 ~8.3% of the words in the novel would be "the".
Total occurrences of the word "the" in War and Peace= 560,000 x .083 = 46,480
Words per page in War and Peace = 560,000/1400 = 400 words per pg
Where Tom goes wrong is he forgets that the word "the" is about 3/5 shorter than the average word, so instead of 400 words per page, a page with only "the"s would be closer to 650 words per page. Had "the" been the same length as the average word, than he would have been correct based on the equation above (~116 pages).
Therefore,
Total Number of Pages with the word "the" = 46,480 / 650 = 71 pages (Tom is incorrect).
It should also be noted that the 8% calculated above is incorrect; the real percentage of "the" in War and Peace is around 6%. This is because Zipf law tends to break down for the most frequent and least frequent words in a document (the edges of the curve). Here is a paper that explains the phenomenon.
http://www.metacarta.com/Collateral/Documents/English-US/Zipfs-law-kornai.pdf
|
Comments rated to be Good Answers:
Comments rated to be "almost" Good Answers: