How to compute similarity between two sentences (syntactical and semantical)

人走茶凉 提交于 2019-12-11 16:55:38

问题


I'm supposed to take two sentences each time and compute if they are similar. By similar I mean, both syntactically and semantically.

INPUT1: Obama signs the law. A new law is signed by Obama.

INPUT2: A Bus is stopped here. A vehicle stops here.

INPUT3: Fire in NY. NY is burnt down.

INPUT4: Fire in NY. 50 died in NY fire.

I don't want to use ontology tree as a soul. I wrote a code to compute Levenshtein distance (LD) between sentences and then decide if the 2nd sentence:

  • can be ignored (INPUT1 and 2),
  • should replace the first sentence (INPUT 3), or
  • store along with the first sentence (INPUT4).

I'm not happy with the code as LD only computes syntactical level (what other methods ?). How can semantic be incorporated (like bus is sort of a vehicle?) .

The code goes here:

%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.

%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2);   %# Initializing the new length.

g=+1;             %# just constant
m=+0;             %# match is cheaper, we seek to minimize
d=+1;             %# not-a-match is more costly.

% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;

m4=0;             %# loop invariant
%# Calculating required edits.
for idx=2:L1;
    for idy=2:L2
        if(str1(idx-1)==str2(idy-1))
            score=m;
        else
            score=d;
        end
        m1=L(idx-1,idy-1) + score;
        m2=L(idx-1,idy) + g;
        m3=L(idx,idy-1) + g;
        L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
    end
end
%# The LD between two strings.
D=L(L1,L2);

%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4)     %# Distance is so less that string 2 seems identical to string 1.
    store=str1;        %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
    %# make string 2 an individual event.
    store= str2;       %# String 2 is somewhat similar to string 1.
                       %# So, string 1 is replaced with string 2 and stored.
else
    %# For all other distances, string 2 is stored along with string 1.
    store={str1; str2};
end

Any help is appreciated.


回答1:


"Semantically". No simple text-book algorithm for that. Natural language (esp. English) is a very complicated and fickle beast. Let's look at (just a small part of) the provided cases:

INPUT1: Obama signs the law. A new law is signed by Obama.

Signing a law makes it a 'new' law.

INPUT2: A Bus is stopped here. A vehicle stops here.

Need to know a bus is a type if vehicle as well as some sort of time relation. Also, what if the bus did stop but does not normally stop or is no longer stopped? It can be taken several ways.

INPUT3: Fire in NY. NY is burnt down.

Need to know that fires can burn things down.

INPUT4: Fire in NY. 50 died in NY fire.

Need to know that fires can kill things (see next). Need to associated the "news headline" (50 WHAT?) with people. The brain can do this somewhat trivially. Computer programs are not brains.

And I'm no English major :-)



来源:https://stackoverflow.com/questions/3655612/how-to-compute-similarity-between-two-sentences-syntactical-and-semantical

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!