Can we reverse engineer Google’s word correction algorithm given a corpus of misspelled words paired with their corrections?
Since I have a single word domain name mischievous, which is one of the 100 most misspelled English words, this allows me to analyze some interesting data from Google’s webmaster tools. I pulled out all the misspellings and impressions within a Levenshtein Distance. There is a nice academic paper that discusses Learning a Spelling Error Model from Search Query Logs that I plan to use to explore some of this data in the future.
A chart and regression of the misspelling data on a log-log chart shows that impressions of misspellings of the word mischievous vs the rank that they appear in all keywords that lead to this blog follows Zipf’s_law. I refitted words with under 10 impressions based on their rank data (ranks >= 83) as webmaster tools only gives a sample value when the impressions are greater than 10.
Raw Data
You can use this table to gauge your spelling (I should add the cumulative distribution so you should see what percentile a misspelling places you )
rank | query | replace | levenshtein | similarity |
---|---|---|---|---|
1 | mischievous | 27000.00 | 0 | 1.00 |
2 | mischevious | 4500.00 | 2 | 0.41 |
3 | mischivious | 700.00 | 2 | 0.50 |
6 | michevious | 500.00 | 3 | 0.21 |
7 | mischevous | 500.00 | 1 | 0.64 |
13 | mischiveous | 170.00 | 2 | 0.50 |
18 | mischieveous | 150.00 | 1 | 0.67 |
19 | mischivous | 150.00 | 1 | 0.64 |
20 | michievous | 110.00 | 1 | 0.64 |
21 | mischeivious | 90.00 | 3 | 0.39 |
23 | mischeivous | 90.00 | 2 | 0.50 |
24 | michevous | 70.00 | 2 | 0.38 |
25 | mischievious | 70.00 | 1 | 0.67 |
26 | mischeveous | 70.00 | 2 | 0.41 |
29 | mischeavious | 60.00 | 3 | 0.39 |
30 | mischiefous | 60.00 | 1 | 0.60 |
31 | michivious | 60.00 | 3 | 0.28 |
32 | mischeavous | 50.00 | 2 | 0.50 |
33 | mishevious | 35.00 | 3 | 0.28 |
35 | miscevious | 35.00 | 3 | 0.35 |
47 | mishievous | 16.00 | 1 | 0.64 |
48 | michievious | 16.00 | 2 | 0.41 |
53 | misgevious | 12.00 | 4 | 0.28 |
54 | micheivious | 12.00 | 4 | 0.20 |
55 | mischvious | 12.00 | 3 | 0.44 |
56 | mischiveious | 12.00 | 2 | 0.47 |
58 | mischevios | 12.00 | 3 | 0.28 |
83 | mischevius | 11.15 | 2 | 0.35 |
101 | miscevous | 8.30 | 2 | 0.57 |
113 | micheavous | 7.01 | 3 | 0.28 |
133 | mischeives | 5.48 | 4 | 0.28 |
140 | mischeviuos | 5.08 | 3 | 0.26 |
153 | mischiefious | 4.44 | 2 | 0.56 |
176 | mischeous | 3.60 | 2 | 0.47 |
196 | mechivious | 3.06 | 4 | 0.21 |
218 | miscievious | 2.61 | 2 | 0.41 |
223 | mechevious | 2.52 | 4 | 0.15 |
241 | mischieved | 2.24 | 3 | 0.53 |
262 | myschevious | 1.98 | 3 | 0.20 |
263 | misjevious | 1.96 | 4 | 0.28 |
273 | mischeviouse | 1.86 | 3 | 0.32 |
277 | machivious | 1.82 | 4 | 0.21 |
279 | mischeiveous | 1.80 | 3 | 0.39 |
282 | mischives | 1.77 | 3 | 0.38 |
321 | mischievous? | 1.45 | 1 | 1.00 |
324 | miscchievous | 1.43 | 1 | 0.79 |
333 | mischeifous | 1.38 | 3 | 0.41 |
334 | mistchivious | 1.37 | 3 | 0.32 |
351 | miscievous | 1.27 | 1 | 0.64 |
357 | mischieveious | 1.24 | 2 | 0.63 |
363 | mishcevious | 1.21 | 3 | 0.26 |
371 | mischievous | 1.17 | 2 | 1.00 |
378 | mischievous. | 1.14 | 1 | 1.00 |
408 | micheveous | 1.01 | 3 | 0.21 |
422 | mischevoius | 0.96 | 2 | 0.41 |
430 | mistivious | 0.94 | 4 | 0.28 |
438 | mischievo | 0.91 | 2 | 0.69 |
444 | misgivious | 0.89 | 4 | 0.28 |
483 | michivous | 0.79 | 2 | 0.38 |
510 | mischievous, | 0.72 | 1 | 1.00 |
525 | mystivious | 0.69 | 5 | 0.15 |
528 | myschivious | 0.69 | 3 | 0.26 |
543 | mis chievous | 0.66 | 1 | 0.67 |
603 | meschivious | 0.56 | 3 | 0.26 |
606 | mischievoud | 0.56 | 1 | 0.71 |
626 | mischeviois | 0.53 | 3 | 0.26 |
629 | micheavious | 0.53 | 4 | 0.20 |
635 | mishievious | 0.52 | 2 | 0.41 |
661 | miscivous | 0.49 | 2 | 0.47 |
671 | meschevious | 0.48 | 3 | 0.20 |
676 | miss chivous | 0.47 | 3 | 0.39 |
734 | mischieves | 0.42 | 2 | 0.53 |