A friend just asked how to do city/state lookup on input strings. I've used metaphones and Levenshtein distance in the past but that seems like over kill. Using a n-gram is a nice and easy solution
easy_install ngram
build file with all the city and state names one per line, place in citystate.data Redwood City, CA Redwood, VA etc
Experiment ( the .2 threshold is a little lax )
import string import ngram cityStateParser = ngram.NGram( items = (line.strip() for line in open('citystate.data')) , N=3, iconv=string.lower, qconv=string.lower, threshold=.2 )
Example:
cityStateParser.search('redwood') [('Redwood VA', 0.5), ('Redwood NY', 0.5), ('Redwood MS', 0.5), ('Redwood City CA', 0.36842105263157893), ... ]
Notes: Because these are NGrams you might get overmatch when the state is part of a ngram in the city i.e. search for "washington" would yield Washington IN with a bette score than "Washington OK"
You might also want read Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching (PDF Download)
If this works for you, consider giving me a vote on StackOverflow.com