word2vec

Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

…衆ロ難τιáo~ 提交于 2019-12-07 22:55:09
问题 def n_similarity(self, ws1, ws2): v1 = [self[word] for word in ws1] v2 = [self[word] for word in ws2] return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0))) This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there

How to convert gensim Word2Vec model to FastText model?

非 Y 不嫁゛ 提交于 2019-12-07 17:52:10
问题 I have a Word2Vec model which was trained on a huge corpus. While using this model for Neural network application I came across quite a few "Out of Vocabulary" words. Now I need to find word embeddings for these "Out of Vocabulary" words. So I did some googling and found that Facebook has recently released a FastText library for this. Now my question is how can I convert my existing word2vec model or Keyedvectors to FastText model? 回答1: FastText is able to create vectors for subword fragments

LINK : fatal error LNK1104: cannot open file 'C:\Users\hp\.pyxbld\lib.win32-2.7\gensim\models\word2vec_inner.pyd'

只谈情不闲聊 提交于 2019-12-07 16:52:13
问题 I run the sourcecode of TWE model. I need to compile the C extension of python. I have installed the Microsoft Visual C++ Compiler for Python 2.7 and Cython. First, I need to run the TWE/train.py: import gensim sentence_word = gensim.models.word2vec.LineSentence("tmp/word.file") print "Training the word vector..." w = gensim.models.Word2Vec(sentence_word,size=400, workers=20) sentence = gensim.models.word2vec.CombinedSentence("tmp/word.file","tmp/topic.file") print "Training the topic vector.

Using freebase vectors with gensim

梦想与她 提交于 2019-12-07 12:32:04
问题 I am trying to use the freebase word embeddings released by Google, but I have a hard time getting the words from the freebase name. model = gensim.models.Word2Vec.load_word2vec_format('freebase-vectors-skipgram1000.bin',binary=True) model.vocab.keys()[:10] Out[22]: [u'/m/026tg5z', u'/m/018jz8', u'/m/04klsk', u'/m/08gd39', u'/m/0kt94', u'/m/05mtf0t', u'/m/05tjjb', u'/m/01m3vn', u'/m/0h7p35', u'/m/03ggvg3'] Does anyone know if it exist some kind of table to map the freebase representations

Load Word2Vec model in Spark

无人久伴 提交于 2019-12-07 09:21:14
问题 Is it possible to load a pretrained (binary) model to spark (using scala) ? I have tried to load one of the binary models which was generated by google like this: import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val model = Word2VecModel.load(sc, "GoogleNews-vectors-negative300.bin") but it is not able to locate the metadata directory. I also created the folder and appended the binary file there but it cannot be parsed. I did not find any wrapper for this issue. 回答1: I wrote a

'utf-8' decode error when loading a word2vec module

不羁岁月 提交于 2019-12-07 04:07:09
问题 I have to use a word2vec module containing tons of Chinese characters. The module was trained by my coworkers using Java and is saved as a bin file. I installed gensim and tries to load the module, but following error occurred: In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected

I used word2vec in deeplearning4j to train word vectors, but those vectors are unstable

一世执手 提交于 2019-12-06 22:57:32
1.I use IntelliJ IDEA build a maven project,code is as follows: System.out.println("Load data...."); SentenceIterator iter = new LineSentenceIterator(new File("/home/zs/programs/deeplearning4j-master/dl4j-test-resources/src/main/resources/raw_sentences.txt")); iter.setPreProcessor(new SentencePreProcessor() { @Override return sentence.toLowerCase(); } }); System.out.println("Build model...."); int batchSize = 1000; int iterations = 30; int layerSize = 300; com.sari.Word2Vec vec= new com.sari.Word2Vec.Builder() .batchSize(batchSize) //# words per minibatch. .sampling(1e-5) // negative sampling.

FastText using pre-trained word vector for text classification

≡放荡痞女 提交于 2019-12-06 22:16:33
问题 I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels. I have tried using fast-text library by Facebook, which has two utilities of interest to me: A) Word Vectors with pre-trained models B) Text Classification utilities However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities. What I want is to be able to classify some text, by taking advantage of the

NLP系列1:NER

淺唱寂寞╮ 提交于 2019-12-06 16:28:01
写在前面:在初学nlp时的第一个任务——NER,尝试了几种方法,cnn+crf、lstm+crf、bert+lstm+crf,毫无疑问,最后结果时Bert下效果最好。 1、关于NER:   NER即命名实体识别是信息提取的一个子任务,但究其本质就是序列标注任务。   eg:   sentence:壹   叁  去 参加一个 NER 交 流 会   tag:   B_PER I_PER O O O O O B_ORG I_ORG I_ORG I_ORG   (咱们暂且认为这个实体标记是正确的)   按照CoNLL2003任务中,LOC-地名 、PER-人名、ORG-机构名、MISC-其他实体,其他此被标记为O,B代表实体名的开始位置,I代表实体名的中间位置。 (在具体任务中标记不一,按照自己的任务中进行标记即可)   NER是一个基础问题,不会不行,但是也是一个非常重要的问题,下面将按照实现过程中碰到的问题依次进行阐述(小白,如有错误,请疯狂留言喷我,一定改正)。 首先的明白NER是一个分类任务,也叫序列标注,其实就是对文本的不同实体标记上对应的标签。 方法主要如下: 基于规则:按照语义学的先验定义规则,由于语言的复杂性,这个方法很难制定出良好的规则; 基于统计学:统计学就是根据大量文本,在其中找到规律,代表作有HMM、CRF; 神经网络:神经网络的大放异彩在各个领域都点亮了一片天空

Does or will H2O provide any pretrained vectors for use with h2o word2vec?

夙愿已清 提交于 2019-12-06 08:24:07
问题 H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself. However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations. Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science