I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.
The method I need to use has to be very simple. Implementing a vanil
In case you're still interested in this problem, I've done something very similar using Lucene Java and Jython. Here's some snippets from my code.
Lucene preprocesses documents and queries using so-called analyzers. This one uses Lucene's built-in n-gram filter:
class NGramAnalyzer(Analyzer):
'''Analyzer that yields n-grams for minlength <= n <= maxlength'''
def __init__(self, minlength, maxlength):
self.minlength = minlength
self.maxlength = maxlength
def tokenStream(self, field, reader):
lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader))
return NGramTokenFilter(lower, self.minlength, self.maxlength)
To turn a list of ngrams into a Document:
doc = Document()
doc.add(Field('n-grams', ' '.join(ngrams),
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))
To store a document in an index:
wr = IndexWriter(index_dir, NGramAnalyzer(), True,
IndexWriter.MaxFieldLength.LIMITED)
wr.addDocument(doc)
Building queries is a little bit more difficult as Lucene's QueryParser expects a query language with special operators, quotes, etc., but it can be circumvented (as partly explained here).