word2vec

How to handle <UKN> tokens in text generation

北城余情 提交于 2019-12-14 03:28:12
问题 In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature. However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens. When the network encounters an unknown (infrequent) word in the training set, what should its output be? Example: Sentence: I went to the mall and bought a <ukn> and some

How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

。_饼干妹妹 提交于 2019-12-13 19:29:02
问题 I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode ( dm=0 ). I know that it's disabled by default with dbow_words=0 . But what happens when we set dbow_words to 1? In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p -dimensional paragraph vectors plus the parameters of the classifier. But multiple sources hint that it is possible in DBOW

How to find that one text is similar to the part of another?

独自空忆成欢 提交于 2019-12-13 03:45:17
问题 We know how to make an assessment of the similarity of two whole texts for example by Word Mover’s Distance. How to find piece inside one text that is similar to another text? 回答1: You could break the text into chunks – ideally by natural groupings, like sentences or paragraphs – then do pairwise comparisons of every chunk against every other, using some text-distance measure. Word Mover's Distance can give impressive results, but it quite slow/expensive to calculate, especially for large

evaluate word2vec with SimLex-999 and wordsim353

岁酱吖の 提交于 2019-12-13 03:32:22
问题 I have evaluated my model with SimLex-999 and wordsim353 but i don't know if the result is ok or not? wordsim353 result Pearson correlation coefficient against C:\ProgramData\Anaconda3\lib\site-packages\gensim\test\test_data\wordsim353.tsv: 0.4895 2019-08-27 08:30:06,655 : INFO : Spearman rank-order correlation coefficient against C:\ProgramData\Anaconda3\lib\site-packages\gensim\test\test_data\wordsim353.tsv: 0.4799 2019-08-27 08:30:06,656 : INFO : Pairs with unknown words ratio: 7.1% ((0

Gensim Word2Vec changing the input sentence order?

我怕爱的太早我们不能终老 提交于 2019-12-13 01:13:41
问题 In the gensim's documentation window size is defined as, window is the maximum distance between the current and predicted word within a sentence. which should mean when looking at context it doesn't go beyond the sentence boundary. right? What i did was i created a document with several thousand tweets and selected a word ( q1 ) and then selected most similar words to q1 (using model.most_similar('q1') ). But then, if I randomly shuffle the tweets in the input document and then did the same

Error loading Pretrained vectors on gensim 0.12

妖精的绣舞 提交于 2019-12-12 20:56:30
问题 I am calling load like this . .7/dist-packages/gensim/utils.py", line 912, in model = gensim.models.Word2Vec.load("F:\\TrialGrounds\\gensimMODEL4\\model4") model = super(Word2Vec, cls).load(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 248, in load obj = unpickle(fname) File "/usr/local/lib/python2unpickle return _pickle.loads(f.read()) AttributeError: 'module' object has no attribute 'call_on_class_only' The model has split 500mb *2 numpy arrays. Can

Gensim: “C extension not loaded, training will be slow.”

情到浓时终转凉″ 提交于 2019-12-12 14:45:22
问题 I am running gensim on Linux Suse. I can start my python program but on startup I get: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training. GCC is installed. Does anyone know what I have to do? 回答1: Try the following: Python 3.x $ pip3 uninstall gensim $ apt-get install python3-dev build-essential $ pip3 install --upgrade gensim Python 2.x $ pip uninstall gensim $ apt-get install python-dev build-essential $ pip install --upgrade gensim

Getting different results from deeplearning4j and word2vec

时光毁灭记忆、已成空白 提交于 2019-12-12 05:56:53
问题 I trained a word embedding model using Google's word2vec. The output is a file that contains a word and its vector. I loaded this trained model in deeplearing4j. WordVectors vec = WordVectorSerializer.loadTxtVectors(new File("vector.txt")); Collection<String> lst = vec.wordsNearest("someWord", 10); But the two lists of similar words obtained from deeplearing4j's package and word2vec's distance function are totally different although I used the same vector file. Does anyone have a good

Python: clustering similar words based on word2vec

一曲冷凌霜 提交于 2019-12-12 04:54:20
问题 This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1") site.download() site.parse() def clean(doc): stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop]) punc_free = ''.join(ch for ch in stop_free if ch not in exclude) normalized = " ".join(lemma.lemmatize(word)

How to find most similar terms/words of a document in doc2vec? [duplicate]

爱⌒轻易说出口 提交于 2019-12-12 04:08:49
问题 This question already has answers here : How to intrepret Clusters results after using Doc2vec? (3 answers) Closed 2 years ago . I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure