How to measure Syntactic Similarity between a query and a document?

99封情书 提交于 2019-12-08 05:27:10

问题


Is there a way to measure the syntactic similarity between a query (sentence) and a document (a set of sentences)?


回答1:


Have you considered using deep linguistic processing tools that involves deep grammars like HPSG and LFG? If you're looking in to feature based syntax similarity you can take a look at Kenji Sagae and Andrew S. Gordon's work on calculating syntactic similarity of verbs using PropBank and then clustering the similar verbs to improve HPSG grammar.

To have a simpler approach I suggest just looking at dependency parses and group sentences with the same parse nodes. Or just POS tag sentences and compare sentences with same POS tags.

For the sake of an simple example, first download and install NLTK (http://nltk.org/) and the hunpos tagger (http://code.google.com/p/hunpos/). Unzip the en_wsj.model.gz and save it at where you python script will be.

import nltk 
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize

s1 = "This is a short sentence"
s2 = "That is the same sentence"

ht = HunposTagger('en_wsj.model')
print ht.tag(word_tokenize(corpus))http://nltk.org/

# Tag the sentences with HunPos
t1 = ht.tag(word_tokenize(s1))
t2 = ht.tag(word_tokenize(s2))

#Extract only the POS tags
pos1 = [i[1] for i in t1]
pos2 = [j[1] for j in t2]

if pos1 == pos2:
    print "same sentence according to POS tags"
else:
    print "diff sentences according to POS tags"

This script above outputs:

>>> print pos1
['DT', 'VBZ', 'DT', 'JJ', 'NN']
>>> print pos2
['DT', 'VBZ', 'DT', 'JJ', 'NN']
>>> if pos1 == pos2:
...     print "same sentence according to POS tags"
... else:
...     print "diff sentences according to POS tags"
... 
same sentence according to POS tags

To modify the above code, try:

  • instead of comparing POS use dependency parses
  • instead of a strict list compare, come up with some statistical methods to measure level of differences



回答2:


Are you looking for something like Apache Lucene?



来源:https://stackoverflow.com/questions/15187303/how-to-measure-syntactic-similarity-between-a-query-and-a-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!