Login | Register

Challenge Questions

Stop in and exercise your brain. Talk about this week's Challenge from CR4 (weekly), Specs & Techs (monthly) or similar puzzles.

So do you have a Challenge Question that could stump the community? Then submit the question with the "correct" answer and we'll post it. If it's really good, we may even roll it up to Specs & Techs. You'll be famous!

Answers to Challenge Questions appear the following Tuesday.

Previous in Blog: Planetary Mass: CR4 Newsletter Challenge (08/04/09)   Next in Blog: Houses in the Valley: CR4 Challenge (08/18/09)
Close

Comments Format:






Close

Subscribe to Discussion:

CR4 allows you to "subscribe" to a discussion
so that you can be notified of new comments to
the discussion via email.

Close

Rating Vote:







63 comments

Heavy Book: CR4 Challenge (08/11/09)

Posted August 09, 2009 5:01 PM

This week's Challenge Question:

John and Tom are enjoying a day off fishing. Tom notices that John brought along an English translation of the epic Tolstoy novel, War and Peace. Tom chuckles and asks John "We could use that book as an anchor! How many pages is it?" John answers "About 1400 pages, it's quite interesting". Tom thought for a moment and then replied "If you rearranged the words in that book I bet you'd have more than 100 pages of just the word "the". Is Tom correct? Why or why not?

And the Answer is...(ANSWER CORRECTED 8/19/09 11:39 AM)

Tom is correct. War and Peace consists of 560,000 words on 1400 pages (in John's paperback). In the English language, "the" tends to be the most frequent. The frequency at which it occurs in a corpus can be estimated by the following formula:


Where k is the frequency rank of the word in question, in this case k=1, since "the" is the most common English word. The s is the exponent for the power law distribution that defines word frequency in the corpus. This value is usually close to 1 and for a rough estimate it's ok to say s~1. HN,s is the Nth Harmonic Number. The general equation for calculating the nth Harmonic Number is:

N is the number of unique terms found in the book. A good rough guess for this number is less than 100,000 unique words (the less unique words there are, the higher the percentage of "the", so we are calculating the minimum number of pages with so large a number. The real number is probably more like 10,000 to 20,000.

Plugging these estimates into equation 1 we get:

HN,s = 12.09; s=1; k=1

f~1/12.09 ~8.3% of the words in the novel would be "the".

Total occurrences of the word "the" in War and Peace= 560,000 x .083 = 46,480

Words per page in War and Peace = 560,000/1400 = 400 words per pg

Where Tom goes wrong is he forgets that the word "the" is about 3/5 shorter than the average word, so instead of 400 words per page, a page with only "the"s would be closer to 650 words per page. Had "the" been the same length as the average word, than he would have been correct based on the equation above (~116 pages).

Therefore,

Total Number of Pages with the word "the" = 46,480 / 650 = 71 pages (Tom is incorrect).

It should also be noted that the 8% calculated above is incorrect; the real percentage of "the" in War and Peace is around 6%. This is because Zipf law tends to break down for the most frequent and least frequent words in a document (the edges of the curve). Here is a paper that explains the phenomenon.

http://www.metacarta.com/Collateral/Documents/English-US/Zipfs-law-kornai.pdf


Interested in this topic? By joining CR4 you can "subscribe" to
this discussion and receive notification when new comments are added.

Comments rated to be Good Answers:

These comments received enough positive ratings to make them "good answers".

Comments rated to be "almost" Good Answers:

Check out these comments that don't yet have enough votes to be "official" good answers and, if you agree with them, rate them!
Power-User
Technical Fields - Technical Writing - New Member

Join Date: Feb 2006
Location: Near Delaware Water Gap
Posts: 369
Good Answers: 8
#1

Re: Heavy Book: CR4 Challenge (08/11/09)

08/09/2009 5:24 PM

It's my understanding that "the" is the most commonly used word in the English language. So, assuming that it occurs 10x on each page of a 1400 page tome, this would give over 14,000 occurrences of the word. A manuscript page is conventionally accepted as 250 words. Obviously this would lend itself to arriving at over 100 pages of nothing but the word "the". Which would make Tolstoy proud, I'm sure.

Guru
United Kingdom - Member - Not a New Member Hobbies - Musician - New Member Hobbies - Fishing - New Member

Join Date: May 2006
Location: Reading, Berkshire, UK, 51º 27' 33.83"N, 1º 0' 21.65"W
Posts: 4064
Good Answers: 106
#3
In reply to #1

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 4:44 AM

14,000/250 = 56

The calcs also need to take word length into account. This site quotes English as having an average word length of 5.1 letters, but I've no idea whether a translation of Tolstoy would have the same average.

__________________
Wit and sense are but different avatars of the same spirit L. Stephen
Power-User
Technical Fields - Technical Writing - New Member

Join Date: Feb 2006
Location: Near Delaware Water Gap
Posts: 369
Good Answers: 8
#4
In reply to #3

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 9:00 AM

Well, I used to be able to do math in my head.....

Power-User
Technical Fields - Technical Writing - New Member

Join Date: Feb 2006
Location: Near Delaware Water Gap
Posts: 369
Good Answers: 8
#2

Re: Heavy Book: CR4 Challenge (08/11/09)

08/09/2009 6:56 PM

Ya know, this also begs the question--why would they bring War and Peace along while they were fishing? I think John and Tom need other hobbies.

Guru
Hobbies - Fishing - Old Salt Hobbies - CNC - New Member United States - US - Statue of Liberty - New Member

Join Date: Mar 2007
Location: Rosedale, Maryland USA
Posts: 1853
Good Answers: 54
#6
In reply to #2

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 1:35 PM

Need a seat cushion.

__________________
Life is not a journey to the grave with the intention of arriving in a pretty, pristine body but rather to come sliding in sideways, all used up and exclaiming, "Wow, what a ride!"
Off Topic (Score 5)
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#10
In reply to #2

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 4:39 PM

So if you throw the 1400 page book overboard, what happens to the level of the lake?

Member

Join Date: Jan 2009
Location: Beamsville, Ontario
Posts: 7
Good Answers: 1
#19
In reply to #2

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 8:28 AM

War and Peace on a fishing trip ? Try NOT getting into a battle over the discussion of the size of the ONE THAT GOT AWAY ! The ? Duh !

__________________
The beach is a place were a man can feel, He's the only soul in the world that's real. - Pete Townshend... Who?
Power-User
Technical Fields - Technical Writing - New Member

Join Date: Feb 2006
Location: Near Delaware Water Gap
Posts: 369
Good Answers: 8
#21
In reply to #19

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 8:54 AM

I'm sure Onthebeach would prefer On the Beach, by Nevil Shute, in which some Australians have other fish to fry.

Off Topic (Score 5)
Guest
#20
In reply to #2

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 8:53 AM

You are right with that one. I read it last winter and found it a 1400 page soap opera, sux.

2
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#5

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 11:33 AM

The Russian language does not use definite articles such as "the", so its frequency in the translation is somewhat dependent on the competency and discretion of the translator. Since the original text was written without an equivalent of the word "the", I wouldn't expect the standard English frequency rules to apply and I suspect its frequency is actually much lower, since sentences were structured without it. So I would probably take that bet with Tom.

Good Answer (Score 2)
2
Power-User

Join Date: Feb 2008
Location: Washington State
Posts: 157
Good Answers: 28
#7

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 2:24 PM

I found an on-line copy of War and Peace at http://www.online-literature.com/tolstoy/war_and_peace/. I copied a few chapters into Word and did a character count for each chapter. Then I found all the "the"s in each chapter. For the samplings I took, I had 50200 characters with spaces and 508 "the"s. Using 5 spaces for each "the", I get 2540 characters with spaces for "the"s. That is a fraction of just over five percent. Five percent of 1400 pages is 70 pages. Even with pages at the front and back of the book not containing "the"s, I don't think you'd have a 100 page "THE" book.

Thanks,

Jim

Good Answer (Score 2)
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#8
In reply to #7

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 4:23 PM

I ran the first 10 chapters of Book One and got the following statistics:

Words: 16729

Characters: 76972

Chars w/Spaces: 93293

"the" occurrences: 815

Percentages come out to:

"the" / Words = 4.9%

"the" / Characters = 3.2%

"the " / Chars w/Spaces = 3.5% (I used "the" with one space instead of two)

Spaces / Chars w/Spaces = 17.5% (that's alot of empty reading )

Per wiktionary, "the" frequency is about 5.6%, based on frequency counts in a wide range of literature (Project Gutenberg). From my understanding, this is based on word count, so the percentage above of 4.9% is inline with this as well as your estimate based on character count, albeit slightly lower. Thus, 5% x 1400 pages x 250 words/page = 17500 "the"s, which at 250 words/page results in 70 pages. However, since "the" is shorter than the average word, all of them should take up fewer pages than the calc based on word count alone.

Power-User

Join Date: Feb 2008
Location: Washington State
Posts: 157
Good Answers: 28
#9
In reply to #8

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 4:29 PM

It seems I am better at higher math than at counting. It continues to amaze me how often I make the most obvious, simple mistakes. I surely could have counted the 3 letters in "the" and added one for a space to get 4 not 5.

Thanks,

Jim

7
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#11
In reply to #9

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 5:16 PM

Actually, the Project Gutenberg site has the War & Peace ebook available for free download. Statistics from Word are as follows:

Pages: 1,203 (Word formatted)

Words: 564,528

Characters: 2,626,404

Chars w/Spaces: 3,139,101

"the" Occurrences: 34,544

A few calcs based on these statistics:

Words / Page = 469

Characters / Word = 4.65

Chars + Spaces / Word = 5.56

"the" Characters = 34544 x 3 = 103,632

"the " Characters = 34544 x 4 = 138,176

Based on the character count with spaces, the "the"s should take up:

138176 char x (1 / 5.56 char/word) x (1 / 469 word/page) = 53 pages

To adjust the number of pages for "the " shorter word length, multiply by 0.719 (4 char / 5.56 char):

53 x 0.719 = 38 pages

To adjust for the pagination difference, multiply by 1.164 (1400 pages / 1203 pages):

38 x 1.164 = 44.3 pages

So there is nowhere near 100 pages of "the"s.

Good Answer (Score 7)
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#12
In reply to #11

Re: Heavy Book: CR4 Challenge (08/11/09)

08/10/2009 5:46 PM

"To adjust the number of pages for "the " shorter word length, multiply by 0.719 (4 char / 5.56 char):

53 x 0.719 = 38 pages"

I just realized the character count already takes into account the shorter word length, so this correction is not necessary. However, the pagination correction is necessary, so the number of "the" pages is:

53 x 1.164 = 61.6 pages

Still well below 100 pages.

Score 1 for Good Answer
Power-User

Join Date: Oct 2007
Posts: 119
Good Answers: 3
#30
In reply to #12

Re: Heavy Book: CR4 Challenge (08/11/09)

08/12/2009 12:22 PM

GA for a very thorough analysis.

Depending upon the version you down loaded, the font kerning might be different from the version in the printed book. Sine the word "the" contains three letters which do not require reducing the character spacing to improve the way it looks, the number of pages should be increased. Below are a lines with the words "the" and "ill" repeated 20 times.

the the the the the the the the the the the the the the the the the the the the

ill ill ill ill ill ill ill ill ill ill ill ill ill ill ill ill ill ill ill ill

The difference between these two words is about 80/42.

A more detailed analysis is required to determine the occurrence of each letter. Considering the picky fact might increase the number of pages above 38, but the basic answer should still be that the number of pages filled with just "the" will be well under 100.

Score 1 for Good Answer
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#32
In reply to #30

Re: Heavy Book: CR4 Challenge (08/11/09)

08/12/2009 5:31 PM

You are absolutely right - I am sure the kerning will be different, especially since my original analysis was based on a non-proportional font with no kerning. After I did the analysis, I tried to play around with the effects of proportional fonts, margins and justification, but I found that the text version I downloaded had carriage returns at the end of each line instead of spaces. This only allowed kerning on a line-by-line basis and would not adjust paragraphs when additional space was created at the end of each line.

I wrote a VBA routine to remove all the inter-paragraph CR's and was then able to play around with different fonts and justification. In fact, just getting rid of all those CR's shortened the book by about 5 pages by utilizing the extra space at the end of lines. I also looked at the impact of the 'word + spaces' length after replacing all the CR's, but there was minimal change and only affected the length of "the" book by about half a page.

Finally, I used one proportional & one non-proportional font to create two books that were about 1400 pages each by adjusting font size, page size & margins. Using the page setup & font from each one, I created a book with 34544 "the"s (as MPM did in #13 below) and came up with the following:

"the" Book (proportional font) = 42 pages

"the" Book (non-proportional) = 50 pages

Score 1 for Good Answer
Guru

Join Date: Apr 2007
Posts: 3303
Good Answers: 56
#36
In reply to #32

Re: Heavy Book: CR4 Challenge (08/11/09)

08/15/2009 8:17 AM

In order to get an accurate answer, we would need to look at line length and numbers of close-spaced lines we can get on a page, rather than trying to count the average length of the characters.

So we would measure the physical length of a "the" measured between a fixed point on the "e" and an immediately preceding "e", and the maximum length of a line. I'm not certain how to handle "The", as this should either follow a full-stop or be at the beginning of a paragraph.

So we've a trade-off between the width of the characters in " the" (including the space) being somewhat above average, and the reduction in wasted space (due to the use of short words and no paragraph breaks). I suspect this would result in rather less than 60 pages.

But there are still ambiguities - Tom may decide the bet included the paragraph break whenever "The" occurs at the start of a paragraph. Such frequent paragraph breaks could easily extend the text beyond 100 pages

2
Power-User

Join Date: Sep 2007
Location: Christchurch, New Zealand
Posts: 161
Good Answers: 16
#13
In reply to #11

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 12:06 AM

I got similar numbers to you in terms of the word count, but I did the calcs slightly differently allowing for the actual 1400 pages in the printed novel. Also my ratio of "the" words worked out to 6.12% (34544/564525) which is quite close to the wiktionary figure of 5.63 :-

Words564525
Pages1400
Words/page403.23
Chars per page based on 5.1 word average2056.48
3 letter words per page685.49
Number of occurences of "the"34544
Total Pages with "the" in them50.39

I also did another more practical check by replicating the word "the" 34544 times and importing into word. This fitted into just over 33 word pages.

I then had to multiply by the factor 1400/1120 as my complete document download fitted into 1120 word pages. This yields 41 pages of tightly packed "the" words.

Finally factoring in the fact that Tom might just allow for a similarity of half lines and new paragraphs etc which were in the original document I came up with a figure of just over 51 pages ( Close enough for me to the calculated 50.39 ).

This tells me indeed that Tom is not correct with his 100 page estimate.

Good Answer (Score 2)
Power-User

Join Date: Jun 2009
Location: Bangalore, India
Posts: 276
Good Answers: 13
#14

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 12:50 AM

Various references say that 'the', 'and' and 'I' have a 10% occurrence in English. Considetring this it seems unlikely that 7% of the book will be 'the' if word length also is taken into account.

May be 40%.

bioramani

__________________
bioramani
Guru

Join Date: Aug 2005
Location: Hemel Hempstead, UK
Posts: 1539
Good Answers: 41
#15

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 3:57 AM

GAs to dac1267 and Maths_Physics_Maniac. Hmmmm 34544 "the"s is going to make some boring reading.

A bloke owns a pub called the Rose and Crown. He hires a sign writer to make him a new sign. The sign writer shows him a sketch of the proposed sign, and, the landlord sais "That looks pretty good, but, I'd like more space between Rose and and and and and Crown."

The above sentence has five consecutive identical words, but still makes reasonable sense. I know of one example with eight identical consecutive words: can anyone figure it out, or do better?

__________________
The early bird catches the worm, but, look what happens to the early worm: Alfred E. Neuman
Off Topic (Score 5)
Guru

Join Date: Mar 2007
Location: India
Posts: 2593
Good Answers: 102
#17
In reply to #15

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 7:46 AM

tell me what is it and I will refer you to a great site

__________________
Fantastic ideas for a Fantastic World, I make the illogical logical.They put me in cars,they put me in yer tv.They put me in stereos and those little radios you stick in your ears.They even put me in watches, they have teeny gremlins for your watches
Off Topic (Score 5)
Power-User

Join Date: May 2009
Posts: 285
Good Answers: 2
#63
In reply to #17

Re: Heavy Book: CR4 Challenge (08/11/09)

08/20/2009 9:55 PM

This a site somewhat similar - but not.

http://www.youtube.com/user/hotforwords

__________________
Kyzine
Off Topic (Score 5)
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#22
In reply to #15

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 10:48 AM

Have we been had? (11 times)

Off Topic (Score 5)
Guru

Join Date: Aug 2005
Location: Hemel Hempstead, UK
Posts: 1539
Good Answers: 41
#23
In reply to #22

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 11:20 AM

Yes!

__________________
The early bird catches the worm, but, look what happens to the early worm: Alfred E. Neuman
Off Topic (Score 5)
Power-User

Join Date: May 2008
Posts: 103
Good Answers: 6
#24
In reply to #22

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 2:00 PM

You must be thinking of the construction which has 11 consecutive "hads". But these occur in two sentences: The first ending with a string of 7 and the second starting with a string of 4.

I'm not sure that this qualifies as better (which was the sub-challenge) than Randall's claim 8 in a single sentence.

Off Topic (Score 5)
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#25
In reply to #24

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 2:07 PM

Bill, who like I had had "had had", had had "had had" also.

Off Topic (Score 5)
Guru

Join Date: Aug 2005
Location: Hemel Hempstead, UK
Posts: 1539
Good Answers: 41
#27
In reply to #24

Re: Heavy Book: CR4 Challenge (08/11/09)

08/12/2009 4:33 AM

Mine was:-

John had had had, where Tom had had had had: had had had had the examiners approval.

11 hads altogether; 8 consecutive: but, I guess you would normally split that into two sentences where I've put the colon.

__________________
The early bird catches the worm, but, look what happens to the early worm: Alfred E. Neuman
Off Topic (Score 5)
Power-User

Join Date: May 2008
Posts: 103
Good Answers: 6
#28
In reply to #27

Re: Heavy Book: CR4 Challenge (08/11/09)

08/12/2009 10:02 AM

Jack, where John had had "had", had had "had had". "Had had" had had the examiner's approval.

11 consecutive; 2 sentences.

I first saw the puzzle this way.

Punctuate the following:-

Jack where John had had had had had had had had had had had the examiner's approval.

Off Topic (Score 5)
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#29
In reply to #28

Re: Heavy Book: CR4 Challenge (08/11/09)

08/12/2009 11:44 AM

This is the puzzle I saw. A semi-colon is used instead of a period, so technically it is one sentence.

James, while John had had "had", had had "had had"; "had had" had had a better effect on the teacher.

Off Topic (Score 5)
Guru

Join Date: Apr 2007
Posts: 3303
Good Answers: 56
#40
In reply to #29

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 9:58 AM

The pub-quiz question was: "what is the largest number of sequential word repeats that could meaningfully be placed in a single sentence".

Philip answered that it was 11, citing "James, while John had had "had", had had "had had"; "had had" had had the examiners approval.

Philippa, answered in her turn as follows:
Had I given Philip's answer, I would have had "had had "had", had had "had had"; "had had" had had", or 12 "had"s in sequence. Moreover this last sentence had had "had "had had "had", had had "had had"; "had had" had had"", a sequence of 14". She averred that she could continue in the same vein as long as desired, and there was therefore no limit.

The quizmaster had a nervous breakdown.

Off Topic (Score 5)
Power-User

Join Date: Sep 2007
Location: Christchurch, New Zealand
Posts: 161
Good Answers: 16
#26
In reply to #15

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 5:43 PM

I'd like more space between Rose and and and and and Crown.

Putting in some inverted commas to imply more meaning you get:-

I'd like more space between "Rose" and "and" and "and" and "Crown".

Off Topic (Score 5)
Guru

Join Date: Aug 2005
Location: Hemel Hempstead, UK
Posts: 1539
Good Answers: 41
#31
In reply to #26

Re: Heavy Book: CR4 Challenge (08/11/09)

08/12/2009 12:44 PM

Aha: so you're saying it would have made more sense if I'd put more punctuation between "Rose" and "and" and "and" and ""and"" and ""and"" and "and"............

__________________
The early bird catches the worm, but, look what happens to the early worm: Alfred E. Neuman
Off Topic (Score 5)
Active Contributor

Join Date: Feb 2009
Posts: 13
Good Answers: 1
#16

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 7:28 AM

Try extracting all the T H & E's in the book & you may find that you get the answer you need.

Very common letters.

__________________
Murphy was an optimist....
Guru

Join Date: Mar 2007
Location: India
Posts: 2593
Good Answers: 102
#18
In reply to #16

Re: Heavy Book: CR4 Challenge (08/11/09)

08/11/2009 7:51 AM

I too thought and that would have been close (with the H being minimum would have contributed to approx 73 pages= 0.053*1400 and with the correction factors might have run close. (for T it is 146 and E 184)

But he said re-arrange words not re-arrange letters.

__________________
Fantastic ideas for a Fantastic World, I make the illogical logical.They put me in cars,they put me in yer tv.They put me in stereos and those little radios you stick in your ears.They even put me in watches, they have teeny gremlins for your watches
Commentator
United States - Member - Engineering Fields - Mechanical Engineering - Hobbies - Fishing - Hobbies - Musician -

Join Date: Jun 2009
Location: California
Posts: 78
Good Answers: 13
#35
In reply to #16

Re: Heavy Book: CR4 Challenge (08/11/09)

08/14/2009 1:53 PM

Ask and ye shall receive:

T's - 224,530

H's - 166,522

E's - 313,055

So you could make 166,522 "the"s with this letter distribution. Using my estimate of 42 pages for 34,544 "the"s, this would result in about 200 pages. However, since Tom said to rearrange the words, not letters, I don't think this is the correct approach. I think he was trying to make the point of how prevalent the word "the" is in the English language, and how much superficial reading is contained within the book.

Guest
#33

Re: Heavy Book: CR4 Challenge (08/11/09)

08/12/2009 5:43 PM

Tom was likely wrong an the utility of the book as an anchor - no surprise that he is also wrong on the density of "the". (On the other hand, if I was floating a bet, I'd probably do something similar to soften up the opposition)

Participant

Join Date: Aug 2009
Posts: 1
#34

Re: Heavy Book: CR4 Challenge (08/11/09)

08/13/2009 10:59 AM

I have to commend the forum members. Whenever I think I'm too much of a Math-Geek to still be called a Mechanical Engineer, I find solace in your Uber-Math-Geekness...

And always such wonderful use of external sources...

... but...

The first thing I thought of when I read, "... rearranged the words...", was rearranging the LETTERS too.

Though it's been shown Tom was wrong using just the 'whole-word' technique, I suspect one could squeeze another 30 - 50 pages of 'the' out of mixed letters. This would fill out Tom's 100 pages.

Thanks.

Power-User

Join Date: Nov 2006
Location: antwerp/belgium/europe
Posts: 162
Good Answers: 5
#37

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 1:45 AM

Wel, well ... it just happend ...

I just read a novel with more than 50,000 words, and not A SINGLE time the word "THE"... I guess this novel pulls down the "the-average" quite a bit ...

check it out !

__________________
44mEurope
Guru

Join Date: Aug 2005
Location: Hemel Hempstead, UK
Posts: 1539
Good Answers: 41
#38
In reply to #37

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 3:38 AM

Here's a word with no vowels!

WHY?

__________________
The early bird catches the worm, but, look what happens to the early worm: Alfred E. Neuman
Off Topic (Score 5)
Guru

Join Date: Apr 2007
Posts: 3303
Good Answers: 56
#39
In reply to #38

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 6:51 AM

Tsk, tsk.
Shhh!
Hmm, I thought that Y was a vowel in this context, as is all of the OUGH in thought and the W in vowel.

But what about the rolled r, hissed s and extended j, l, z, v, ch, sh, h and f (some used in the overly-quoted Czech phrases "Strč prst skrz krk" and "Smrž pln skvrn zvlhl z mlh")? If it sounds like a vowel, behaves like a vowel, I would say... (qk, qk).
(I'd accept argument that maybe it needs to be voiced as in the first four in my list to be classed as a vowel)

Off Topic (Score 5)
Power-User

Join Date: Feb 2008
Location: Washington State
Posts: 157
Good Answers: 28
#41

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 10:11 AM

I believe the answer to the challenge question provides some insight into why we need to look carefully at all of the parameters in what we are analyzing. The analysis assumes all the words are of the same length. "The" is only 3 letters long while the average word length is probably something like 4.5 letters long. Assuming each word has one space, the ratio of "the " length to average word length (plus a space) is about 73%. Taking word length into consideration, would drop the official answer from 116 pages to something less than 100.

Thanks,

Jim

Score 1 for Good Answer
Guest
#42
In reply to #41

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 10:49 AM

Agreed - it's another example of "the answer" missing any reality check.

It starts with an invalid formula (academic measurements of occurrence find "the" at 6.2% on average including speech. For text it falls to 6% and imaginative text to 5.2%). Even without any other corrections, 6.2% would reduce the result of this calculation to about 87 pages. (And, of course it has been established that "War and Peace" has above average use of adjectives and (long) names, even for imaginative(?) texts.

It continues by ignoring the effect of length of word length - accepted that this depends on the specific font, but it is extremely unlikely to be more than the 86% at which an 8.3% occurrence brings it below 100 pages.

That is of course without considering the effect of paragraph breaks, which would further shorten the length of a bowdlerisation consisting entirely of "the"s.
(On the other hand, asserting that all paragraph breaks have to be retained would be the only way that Tom could avoid losing).

Guru

Join Date: Apr 2007
Posts: 3303
Good Answers: 56
#44
In reply to #42

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 10:55 AM

That was I - apologies. Also that I forgot to credit the vastly superior contributions of the community - particularly those of jim, dac and mpm

Commentator

Join Date: Apr 2005
Location: Altair 4
Posts: 63
Good Answers: 1
#43

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 10:55 AM

Regarding the answer:

And the Answer is...

Tom is correct. War and Peace consists of 560,000 words on 1400 pages (in John's paperback). In the English language, "the" tends to be the most frequent. The frequency at which it occurs in a corpus can be estimated by the following formula:


Where k is the frequency rank of the word in question, in this case k=1, since "the" is the most common English word. The s is the exponent for the power law distribution that defines word frequency in the corpus. This value is usually close to 1 and for a rough estimate it's ok to say s~1. HN,s is the Nth Harmonic Number. The general equation for calculating the nth Harmonic Number is:

N is the number of unique terms found in the book. A good rough guess for this number is less than 100,000 unique words (the less unique words there are, the higher the percentage of "the", so we are calculating the minimum number of pages with so large a number. The real number is probably more like 10,000 to 20,000.

Plugging these estimates into equation 1 we get:

HN,s = 12.09; s=1; k=1

f~1/12.09 ~8.3% of the words in the novel would be "the".

Total occurrences of the word "the" in War and Peace= 560,000 x .083 = 46,480

Words per page in War and Peace = 560,000/1400 = 400 words per pg

Total Number of Pages with the word "the" = 46,480 / 400 = 116 pages (Tom is correct)

Thus there would be at least 116 pages of the word "the".

Not only is the notation here muddled up (different uses of 'k', and inconsistent use of HN, HN, Hn, HN,s [?]) but the conclusion is based on a statistical estimate which is contradicted by the actual values reported by Project Gutenberg (assuming that DAC1267 did not err in his Reply #11).

/Disappointed in this 'Correct Answer'.

__________________
"Welcome to Altair 4, gentlemen."
Guest
#45

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 2:18 PM

The solution does not consider word length. If the average word length is 5 then there would be less than 100 pages of "the".

Guru
United Kingdom - Member - Not a New Member Hobbies - Musician - New Member Hobbies - Fishing - New Member

Join Date: May 2006
Location: Reading, Berkshire, UK, 51º 27' 33.83"N, 1º 0' 21.65"W
Posts: 4064
Good Answers: 106
#46

Re: Heavy Book: CR4 Challenge (08/11/09)

08/18/2009 7:36 PM

And the Answer is...

The answer you've given is bollocks. Many of the replies have given practical solutions -not fancy imaginary statistical clap-trap nonsense speculations - which you claim as "the answer", even though it has clearly been shown to be false.

Think about it.

CR4 Challenge Admins - this has left a bad taste in my mouth, and I don't think I'm alone in this.

__________________
Wit and sense are but different avatars of the same spirit L. Stephen
Guest
#47
In reply to #46

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 6:52 AM

I don't think that CR4 admin even look at the thread.

If you want to make the point it's best write to them direct.

Off Topic (Score 5)
Guru
United Kingdom - Member - Hearts of Oak Popular Science - Paleontology - New Member Engineering Fields - Mechanical Engineering - New Member

Join Date: May 2005
Location: In the Garden
Posts: 1424
Good Answers: 12
#48

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 8:10 AM

(the less unique words there are, the higher the percentage

Grammar, my dears, grammar!!!!

That should read the fewer unique words

Just because we're engineers doesn't mean we can't talk and write properly!

__________________
Chaos always wins because it's better organised
The Engineer
Engineering Fields - Engineering Physics - Physics... United States - Member - NY Popular Science - Genetics - Organic Chemistry... Popular Science - Cosmology - New Member

Join Date: Feb 2005
Location: Albany, New York
Posts: 3558
Good Answers: 81
#52
In reply to #48

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 1:00 PM

I think the usage was perfect Americanish. "Fewer" has gone the way of "whom" here in the states. We use the word "fewer" when we want to sound proper and drop it the rest of the time. Stop trying to force your English on us.

http://www.editrix.us/2008/07/less-and-fewer.html

At least we pronounce our H's.

http://www.youtube.com/watch?v=EAYUuspQ6BY&feature=related

__________________
Cause you said the brains I had went to my head
Guru
United Kingdom - Member - Not a New Member Hobbies - Musician - New Member Hobbies - Fishing - New Member

Join Date: May 2006
Location: Reading, Berkshire, UK, 51º 27' 33.83"N, 1º 0' 21.65"W
Posts: 4064
Good Answers: 106
#54
In reply to #52

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 1:57 PM

... but it's soooo ugly!

"less water - and fewer ice cubes, please!" - good

"less water - and less ice, please!" - good

"less water - and less ice cubes, please!" - yeuch!

Never mind starting on about "fewer water ..."

The unique words are countable - not an analog quantity, dammit!

I've finished now.

__________________
Wit and sense are but different avatars of the same spirit L. Stephen
Off Topic (Score 5)
The Engineer
Engineering Fields - Engineering Physics - Physics... United States - Member - NY Popular Science - Genetics - Organic Chemistry... Popular Science - Cosmology - New Member

Join Date: Feb 2005
Location: Albany, New York
Posts: 3558
Good Answers: 81
#55
In reply to #54

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 2:00 PM

Language changes. Besides, in my research, I found that in old English less was used all the time. So maybe we're just reverting back to proper English.

__________________
Cause you said the brains I had went to my head
Off Topic (Score 5)
Guest
#58
In reply to #55

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 3:43 PM

"maybe we're just reverting back to proper English"

"turning back to" or "going back to" but "reverting to"

Off Topic (Score 5)
The Engineer
Engineering Fields - Engineering Physics - Physics... United States - Member - NY Popular Science - Genetics - Organic Chemistry... Popular Science - Cosmology - New Member

Join Date: Feb 2005
Location: Albany, New York
Posts: 3558
Good Answers: 81
#59
In reply to #58

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 4:02 PM

I hate being corrected by guests, even though I know the guest is right in this case. It's like "I'm not going to bother to register or sign in, but I am willing to take the time to correct you".

Here's a link that discusses whether it's valid or not (personally I think the guest is right, I should have written "reverting to"):

http://forum.wordreference.com/showthread.php?t=97952

__________________
Cause you said the brains I had went to my head
Off Topic (Score 5)
Guest
#60
In reply to #59

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 4:50 PM

Didn't intend to upset you.

I chose to post anonymously, since we are "off topic" and these postings are generally frivolous.

Signed SlideRuler

Off Topic (Score 5)
The Engineer
Engineering Fields - Engineering Physics - Physics... United States - Member - NY Popular Science - Genetics - Organic Chemistry... Popular Science - Cosmology - New Member

Join Date: Feb 2005
Location: Albany, New York
Posts: 3558
Good Answers: 81
#62
In reply to #60

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 9:38 PM

Yeah, posts are a difficult forum to convey emotion, I wasn't upset, just irrationally annoyed (and sort of making fun of myself for being irrationally annoyed) that it was a guest post that corrected me.

So hopefully I'm being clear that I'm not at all angry, and in truth, any annoyance I felt at being anomously corrected is gone since you no longer feel anonomous to me.

And anyway, you were right.

Roger

__________________
Cause you said the brains I had went to my head
Off Topic (Score 5)
Guru

Join Date: Aug 2005
Location: Hemel Hempstead, UK
Posts: 1539
Good Answers: 41
#56
In reply to #54

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 2:27 PM

"No water, no ice, and more scotch please": wonderful.

__________________
The early bird catches the worm, but, look what happens to the early worm: Alfred E. Neuman
Off Topic (Score 5)
Guru
United Kingdom - Member - Not a New Member Hobbies - Musician - New Member Hobbies - Fishing - New Member

Join Date: May 2006
Location: Reading, Berkshire, UK, 51º 27' 33.83"N, 1º 0' 21.65"W
Posts: 4064
Good Answers: 106
#61
In reply to #56

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 7:13 PM

Well, I was kinda getting round to that ...

Cheers!

__________________
Wit and sense are but different avatars of the same spirit L. Stephen
Off Topic (Score 5)
Guest
#57
In reply to #52

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 2:36 PM

"Fewer" vs "less" is a minor offender.

Since this is an engineering forum, I'd like to let you know that my pet (engineering) peeve of American usage is the expression, "high rate of speed". Cars are routinely reported to be "travelling at a high rate of speed", (sometimes for a long period of time). Even worse is when they travel at a higher rate of speed than 80mph.

Active Contributor

Join Date: Jul 2007
Posts: 11
#49

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 10:26 AM

Why the s-1 and k=1?

Guest
#50
In reply to #49

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 10:50 AM

Given that "the answer" is demonstrably wrong by a larger factor than could possibly be explained by the statistics it assumes, there can be no good reason for any of it.

Why probe further? If you are looking to recognise why "the answer" is wrong, you need look no further than assigning unweighted statistics to situations where systematic effects dominate.

Score 1 for Good Answer
Lead Editor
United States - Member - New Member Technical Fields - Technical Writing - New Member Popular Science - Paleontology - New Member Hobbies - Musician - New Member Hobbies - Fishing - New Member

Join Date: Dec 2004
Location: Upstate New York
Posts: 1918
Good Answers: 31
#51

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 11:39 AM

Sorry folks - we posted an incorrect answer to the question yesterday. I just got back from a two week vacation and wasn't as focused as I should have been. The correct answer is now available.

Guru

Join Date: Apr 2007
Posts: 3303
Good Answers: 56
#53
In reply to #51

Re: Heavy Book: CR4 Challenge (08/11/09)

08/19/2009 1:33 PM

Hi Chris

a) I couldn't see the link to the paper.
b) I don't see the evidence for this formula even in the average frequency of words - unless you accept very wide tolerances on relative frequency of occurrence. (N.B. the variance on the ratio must surely increase as the words get rarer)
c) According to this formula, any word that occurs with similar frequency in most documents will be more common in the average of individual documents than in the aggregate of all documents, which is clearly nonsense. If the word does not appear with similar frequency in most documents, you would need to know its rank to make the calculation (even if the formula does work in this case). That means that the method (as applied) fails in principle for all words, not just the rarest and the most common. That in turn means that the basic method is never useful for this purpose, as establishing the rank of a word means you would need to count not only of the occurrences of the word you wish to evaluate, but also of many others.

Regards

Fyz

63 comments
Interested in this topic? By joining CR4 you can "subscribe" to
this discussion and receive notification when new comments are added.

Comments rated to be Good Answers:

These comments received enough positive ratings to make them "good answers".

Comments rated to be "almost" Good Answers:

Check out these comments that don't yet have enough votes to be "official" good answers and, if you agree with them, rate them!
Copy to Clipboard

Users who posted comments:

44mEurope (1), bioramani (1), Chris Leonard (1), dac1267 (10), English Rose (1), Guest (9), JamesClopton (1), jim35848 (3), JohnDG (4), Kinsale (1), KRW43 (1), Kyzine (1), MARQUE (1), Maths_Physics_Maniac (2), Onthebeach (1), ozzb (1), Physicist? (5), Randall (6), Roger Pink (4), sb (2), SlideRuler (2), Snave (1), sue (4)

Previous in Blog: Planetary Mass: CR4 Newsletter Challenge (08/04/09)   Next in Blog: Houses in the Valley: CR4 Challenge (08/18/09)
You might be interested in: FLASH Memory Chips, SRAM Memory Chips, DRAM Memory Chips