问题
Is there a way to measure the syntactic similarity between a query (sentence) and a document (a set of sentences)?
回答1:
Have you considered using deep linguistic processing tools that involves deep grammars like HPSG and LFG? If you're looking in to feature based syntax similarity you can take a look at Kenji Sagae and Andrew S. Gordon's work on calculating syntactic similarity of verbs using PropBank and then clustering the similar verbs to improve HPSG grammar.
To have a simpler approach I suggest just looking at dependency parses and group sentences with the same parse nodes. Or just POS tag sentences and compare sentences with same POS tags.
For the sake of an simple example, first download and install NLTK (http://nltk.org/) and the hunpos tagger (http://code.google.com/p/hunpos/). Unzip the en_wsj.model.gz and save it at where you python script will be.
import nltk
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize
s1 = "This is a short sentence"
s2 = "That is the same sentence"
ht = HunposTagger('en_wsj.model')
print ht.tag(word_tokenize(corpus))http://nltk.org/
# Tag the sentences with HunPos
t1 = ht.tag(word_tokenize(s1))
t2 = ht.tag(word_tokenize(s2))
#Extract only the POS tags
pos1 = [i[1] for i in t1]
pos2 = [j[1] for j in t2]
if pos1 == pos2:
print "same sentence according to POS tags"
else:
print "diff sentences according to POS tags"
This script above outputs:
>>> print pos1
['DT', 'VBZ', 'DT', 'JJ', 'NN']
>>> print pos2
['DT', 'VBZ', 'DT', 'JJ', 'NN']
>>> if pos1 == pos2:
... print "same sentence according to POS tags"
... else:
... print "diff sentences according to POS tags"
...
same sentence according to POS tags
To modify the above code, try:
- instead of comparing POS use dependency parses
- instead of a strict list compare, come up with some statistical methods to measure level of differences
回答2:
Are you looking for something like Apache Lucene?
来源:https://stackoverflow.com/questions/15187303/how-to-measure-syntactic-similarity-between-a-query-and-a-document