Availability Bias and the Letter K

About a year ago I was reading the excellent book Thinking Fast and Slow by Daniel Kahneman. In one chapter the author mentions an experiment in which participants were asked whether the letter “k” was more likely to appear in the first or third position within a word. The point was to show that humans tend to overemphasize information that is readily available to them. So instead of making a genuinely valid statistical prediction they would simply predict more “k”s at the start because it’s easier to think of words that start with “k” than words that end with “k”. The book asserted this was not true and there were in fact about three times as many words with “k” as the third letter.

First Analysis

I tried this on my father and sure enough he believed there were more words with “k” at the start. I showed him the book but he was still skeptical. So I decided to write a brief program to prove the point. It would simply scan through the Unix spell checking dictionary (I was using Ubuntu at the time) and compare the number of words with “k” in the third position to the number with “k” in the first position. But to my surprise the results showed that “k” was actually more common in the first position than the third. I don’t have my original results from that day. But I recently heard the claim that “k” is most common in the first position again on a youtube video (warning, silliness). So I decided to write a new script to find out. You can see the dictionary I used at [https://github.com/dwyl/english-words] and my code at [https://github.com/WilliamRitson/letter-positon-frequency]

Here were my initial results

Position	K	…	A	B	C	D	E	F	G
1	3551	…	23801	17747	30613	18262	13618	11670	10307
2	934	…	44414	1995	6421	2215	51017	999	1118
3	1756	…	27211	10218	18636	11840	28407	5336	8668
4	5817	…	27237	7101	16992	12440	39524	5138	8744
5	3967	…	28577	6091	13002	9879	38609	4443	7040
6	2074	…	26818	5185	13982	9579	37590	4464	7913
7	2528	…	25919	3365	10221	9423	37259	2493	7814
8	2273	…	19899	3164	9012	10072	31991	1965	7167
9	1588	…	15861	2556	7134	8097	26989	1073	6139
10	672	…	11768	1957	6122	6287	20509	610	5052

I did it for the full alphabet but I omitted everything after K to make the table fit better. Regardless, you can see that K is more common in the first position than the third, but less common than in the fourth. This looked like it might be an off by one error to me. But I checked my code several times and could not find one. Feel free to look at my python code on github to make sure I did not make such an error.

Round Two

Having failed to replicate the expected pattern I decided to look up the original paper that the book referred to. I found it here eventually. The actual quote from the paper is “In fact, a typical text contains twice as many words in which K is in the third position than words that start with K.” (note, it’s possible that by “third positon” they mean the third position in a zero indexed system. I think this is unlikely as it would probably confuse the participants of the psychological study as much as it confused me). But as you can see my methodology was wrong. The claim was not that there are more words in the dictionary with “k” as the third letter than the first. But rather that these words show up more often in a typical text.

So I decided to do a new analysis. This time I would take a large corpus of English text and tokenize it into words. Then I would count how many of each word existed. Finally, I would analyze what position each letter occurred in weighting it by the number of times the word appeared in the text.

Here were my results on the Open American National Corpus

Position	K	…	A	B	C	D	E	F	G
1	122504	…	1634455	600167	734704	473711	367456	538419	280818
2	15663	…	1292515	86295	135800	89789	1751601	547685	49080
3	95182	…	957764	103449	401525	688566	1810750	159238	221103
4	151683	…	426529	87633	274846	312889	1314724	92561	152911
5	76370	…	340149	48873	203342	211056	1029680	40260	161231
6	11435	…	328638	51933	182568	248063	596528	47194	115067
7	10015	…	257082	29521	118141	137173	526662	22176	141796
8	7675	…	131102	15757	84478	150198	365712	11150	89589
9	5411	…	121096	10563	47664	101546	225703	3853	80903
10	2053	…	50816	5168	27412	60018	118078	2115	44215

As you can see not much changed. K is still more common in the first positon than the third (although still less common than the fourth).

Conclusion

So the result of several hours of programing and research was that I still could not replicate the original claim. I don’t by any means claim this as a debunking or rebuttal. Its entirely possible and likely that either misunderstood the claim, my code contains an error, or the data I used is not statistically valid. Regardless, it was an interesting result and I thought I would share it.