Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

前端 未结 5 1710
逝去的感伤
逝去的感伤 2020-11-28 17:58

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.

The method I need to use has to be very simple. Implementing a vanil

5条回答
  •  星月不相逢
    2020-11-28 18:37

    In case you're still interested in this problem, I've done something very similar using Lucene Java and Jython. Here's some snippets from my code.

    Lucene preprocesses documents and queries using so-called analyzers. This one uses Lucene's built-in n-gram filter:

    class NGramAnalyzer(Analyzer):
        '''Analyzer that yields n-grams for minlength <= n <= maxlength'''
        def __init__(self, minlength, maxlength):
            self.minlength = minlength
            self.maxlength = maxlength
        def tokenStream(self, field, reader):
            lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader))
            return NGramTokenFilter(lower, self.minlength, self.maxlength)
    

    To turn a list of ngrams into a Document:

    doc = Document()
    doc.add(Field('n-grams', ' '.join(ngrams),
            Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))
    

    To store a document in an index:

    wr = IndexWriter(index_dir, NGramAnalyzer(), True,
                     IndexWriter.MaxFieldLength.LIMITED)
    wr.addDocument(doc)
    

    Building queries is a little bit more difficult as Lucene's QueryParser expects a query language with special operators, quotes, etc., but it can be circumvented (as partly explained here).

提交回复
热议问题