I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence
You have to index, or normalize the strings to avoid the O(n^2) run. Basically, you have to map each string to a normal form, and to build a reverse dictionary with all the words linked to corresponding normal forms.
Let's consider that normal forms of 'world' and 'word' are the same. So, first build a reversed dictionary of Normalized -> [word1, word2, word3], e.g.:
"world" <-> Normalized('world')
"word" <-> Normalized('wrd')
to:
Normalized('world') -> ["world", "word"]
There you go - all the items (lists) in the Normalized dict which have more than one value - are the matched words.
The normalization algorithm depends on data i.e. the words. Consider one of the many: