About a year ago I was reading the excellent book Thinking Fast and Slow by Daniel Kahneman. In one chapter the author mentions an experiment in which participants were asked whether the letter “k” was more likely to appear in the first or third position within a word. The point was to show that humans tend to overemphasize information that is readily available to them. So instead of making a genuinely valid statistical prediction they would simply predict more “k”s at the start because it’s easier to think of words that start with “k” than words that end with “k”. The book asserted this was not true and there were in fact about three times as many words with “k” as the third letter.
First Analysis
I tried this on my father and sure enough he believed there were more words with “k” at the start. I showed him the book but he was still skeptical. So I decided to write a brief program to prove the point. It would simply scan through the Unix spell checking dictionary (I was using Ubuntu at the time) and compare the number of words with “k” in the third position to the number with “k” in the first position. But to my surprise the results showed that “k” was actually more common in the first position than the third. I don’t have my original results from that day. But I recently heard the claim that “k” is most common in the first position again on a youtube video (warning, silliness). So I decided to write a new script to find out. You can see the dictionary I used at [https://github.com/dwyl/english-words] and my code at [https://github.com/WilliamRitson/letter-positon-frequency]
Here were my initial results
Position | K | … | A | B | C | D | E | F | G |
---|---|---|---|---|---|---|---|---|---|
1 | 3551 | … | 23801 | 17747 | 30613 | 18262 | 13618 | 11670 | 10307 |
2 | 934 | … | 44414 | 1995 | 6421 | 2215 | 51017 | 999 | 1118 |
3 | 1756 | … | 27211 | 10218 | 18636 | 11840 | 28407 | 5336 | 8668 |
4 | 5817 | … | 27237 | 7101 | 16992 | 12440 | 39524 | 5138 | 8744 |
5 | 3967 | … | 28577 | 6091 | 13002 | 9879 | 38609 | 4443 | 7040 |
6 | 2074 | … | 26818 | 5185 | 13982 | 9579 | 37590 | 4464 | 7913 |
7 | 2528 | … | 25919 | 3365 | 10221 | 9423 | 37259 | 2493 | 7814 |
8 | 2273 | … | 19899 | 3164 | 9012 | 10072 | 31991 | 1965 | 7167 |
9 | 1588 | … | 15861 | 2556 | 7134 | 8097 | 26989 | 1073 | 6139 |
10 | 672 | … | 11768 | 1957 | 6122 | 6287 | 20509 | 610 | 5052 |
I did it for the full alphabet but I omitted everything after K to make the table fit better. Regardless, you can see that K is more common in the first position than the third, but less common than in the fourth. This looked like it might be an off by one error to me. But I checked my code several times and could not find one. Feel free to look at my python code on github to make sure I did not make such an error.
Round Two
Having failed to replicate the expected pattern I decided to look up the original paper that the book referred to. I found it here eventually. The actual quote from the paper is “In fact, a typical text contains twice as many words in which K is in the third position than words that start with K.” (note, it’s possible that by “third positon” they mean the third position in a zero indexed system. I think this is unlikely as it would probably confuse the participants of the psychological study as much as it confused me). But as you can see my methodology was wrong. The claim was not that there are more words in the dictionary with “k” as the third letter than the first. But rather that these words show up more often in a typical text.
So I decided to do a new analysis. This time I would take a large corpus of English text and tokenize it into words. Then I would count how many of each word existed. Finally, I would analyze what position each letter occurred in weighting it by the number of times the word appeared in the text.
Here were my results on the Open American National Corpus
Position | K | … | A | B | C | D | E | F | G |
---|---|---|---|---|---|---|---|---|---|
1 | 122504 | … | 1634455 | 600167 | 734704 | 473711 | 367456 | 538419 | 280818 |
2 | 15663 | … | 1292515 | 86295 | 135800 | 89789 | 1751601 | 547685 | 49080 |
3 | 95182 | … | 957764 | 103449 | 401525 | 688566 | 1810750 | 159238 | 221103 |
4 | 151683 | … | 426529 | 87633 | 274846 | 312889 | 1314724 | 92561 | 152911 |
5 | 76370 | … | 340149 | 48873 | 203342 | 211056 | 1029680 | 40260 | 161231 |
6 | 11435 | … | 328638 | 51933 | 182568 | 248063 | 596528 | 47194 | 115067 |
7 | 10015 | … | 257082 | 29521 | 118141 | 137173 | 526662 | 22176 | 141796 |
8 | 7675 | … | 131102 | 15757 | 84478 | 150198 | 365712 | 11150 | 89589 |
9 | 5411 | … | 121096 | 10563 | 47664 | 101546 | 225703 | 3853 | 80903 |
10 | 2053 | … | 50816 | 5168 | 27412 | 60018 | 118078 | 2115 | 44215 |
As you can see not much changed. K is still more common in the first positon than the third (although still less common than the fourth).
Conclusion
So the result of several hours of programing and research was that I still could not replicate the original claim. I don’t by any means claim this as a debunking or rebuttal. Its entirely possible and likely that either misunderstood the claim, my code contains an error, or the data I used is not statistically valid. Regardless, it was an interesting result and I thought I would share it.