About a year ago I was reading the excellent book Thinking Fast and Slow by Daniel Kahneman. In one chapter the author mentions an experiment in which participants were asked whether the letter “k” was more likely to appear in the first or third position within a word. The point was to show that humans tend to overemphasize information that is readily available to them. So instead of making a genuinely valid statistical prediction they would simply predict more “k”s at the start because it’s easier to think of words that start with “k” than words that end with “k”. The book asserted this was not true and there were in fact about three times as many words with “k” as the third letter.

First Analysis

I tried this on my father and sure enough he believed there were more words with “k” at the start. I showed him the book but he was still skeptical. So I decided to write a brief program to prove the point. It would simply scan through the Unix spell checking dictionary (I was using Ubuntu at the time) and compare the number of words with “k” in the third position to the number with “k” in the first position. But to my surprise the results showed that “k” was actually more common in the first position than the third. I don’t have my original results from that day. But I recently heard the claim that “k” is most common in the first position again on a youtube video (warning, silliness). So I decided to write a new script to find out. You can see the dictionary I used at [https://github.com/dwyl/english-words] and my code at [https://github.com/WilliamRitson/letter-positon-frequency]

Here were my initial results

Position K A B C D E F G
1 3551 23801 17747 30613 18262 13618 11670 10307
2 934 44414 1995 6421 2215 51017 999 1118
3 1756 27211 10218 18636 11840 28407 5336 8668
4 5817 27237 7101 16992 12440 39524 5138 8744
5 3967 28577 6091 13002 9879 38609 4443 7040
6 2074 26818 5185 13982 9579 37590 4464 7913
7 2528 25919 3365 10221 9423 37259 2493 7814
8 2273 19899 3164 9012 10072 31991 1965 7167
9 1588 15861 2556 7134 8097 26989 1073 6139
10 672 11768 1957 6122 6287 20509 610 5052

I did it for the full alphabet but I omitted everything after K to make the table fit better. Regardless, you can see that K is more common in the first position than the third, but less common than in the fourth. This looked like it might be an off by one error to me. But I checked my code several times and could not find one. Feel free to look at my python code on github to make sure I did not make such an error.

Round Two

Having failed to replicate the expected pattern I decided to look up the original paper that the book referred to. I found it here eventually. The actual quote from the paper is “In fact, a typical text contains twice as many words in which K is in the third position than words that start with K.” (note, it’s possible that by “third positon” they mean the third position in a zero indexed system. I think this is unlikely as it would probably confuse the participants of the psychological study as much as it confused me). But as you can see my methodology was wrong. The claim was not that there are more words in the dictionary with “k” as the third letter than the first. But rather that these words show up more often in a typical text.

So I decided to do a new analysis. This time I would take a large corpus of English text and tokenize it into words. Then I would count how many of each word existed. Finally, I would analyze what position each letter occurred in weighting it by the number of times the word appeared in the text.

Here were my results on the Open American National Corpus

Position K A B C D E F G
1 122504 1634455 600167 734704 473711 367456 538419 280818
2 15663 1292515 86295 135800 89789 1751601 547685 49080
3 95182 957764 103449 401525 688566 1810750 159238 221103
4 151683 426529 87633 274846 312889 1314724 92561 152911
5 76370 340149 48873 203342 211056 1029680 40260 161231
6 11435 328638 51933 182568 248063 596528 47194 115067
7 10015 257082 29521 118141 137173 526662 22176 141796
8 7675 131102 15757 84478 150198 365712 11150 89589
9 5411 121096 10563 47664 101546 225703 3853 80903
10 2053 50816 5168 27412 60018 118078 2115 44215

As you can see not much changed. K is still more common in the first positon than the third (although still less common than the fourth).

Conclusion

So the result of several hours of programing and research was that I still could not replicate the original claim. I don’t by any means claim this as a debunking or rebuttal. Its entirely possible and likely that either misunderstood the claim, my code contains an error, or the data I used is not statistically valid. Regardless, it was an interesting result and I thought I would share it.