问题
In the gensim's documentation window
size is defined as,
window is the maximum distance between the current and predicted word within a sentence.
which should mean when looking at context it doesn't go beyond the sentence boundary. right?
What i did was i created a document with several thousand tweets and selected a word (q1
) and then selected most similar words to q1
(using model.most_similar('q1')
). But then, if I randomly shuffle the tweets in the input document and then did the same experiment (without changing word2vec parameters) I got a different set most_similar words to q1
.
Can't really understand why that happens if only it's gonna look at is sentence level information? can anyone explain this?
EDIT: added model parameters and a graph
used model parameters:
model1 = word2vec.Word2Vec(sents1 , size=100, window=5, min_count=5, iter=n_iter, sg=0)
Graph:
To draw the graph what i did was I ran word2vec with above parameters for the original document (D) and the shuffled document (D') and took the top 10 or 20 (two bars) most_similar('q')
words to a specific query word q
, and calculated the jaccard similarity score between the two sets of words when iter=1,10,100.
It seems as the no of iterations increase, lesser and lesser similar words between the two sets of words got from running word2vec on D and D'.
can't really understand why this is happening or what's going on?
来源:https://stackoverflow.com/questions/36790867/gensim-word2vec-changing-the-input-sentence-order