word2vec | 易学教程

Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

阅读更多关于 Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

问题 def n_similarity(self, ws1, ws2): v1 = [self[word] for word in ws1] v2 = [self[word] for word in ws2] return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0))) This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there

How to convert gensim Word2Vec model to FastText model?

阅读更多关于 How to convert gensim Word2Vec model to FastText model?

问题 I have a Word2Vec model which was trained on a huge corpus. While using this model for Neural network application I came across quite a few "Out of Vocabulary" words. Now I need to find word embeddings for these "Out of Vocabulary" words. So I did some googling and found that Facebook has recently released a FastText library for this. Now my question is how can I convert my existing word2vec model or Keyedvectors to FastText model? 回答1: FastText is able to create vectors for subword fragments

LINK : fatal error LNK1104: cannot open file 'C:\Users\hp\.pyxbld\lib.win32-2.7\gensim\models\word2vec_inner.pyd'

阅读更多关于 LINK : fatal error LNK1104: cannot open file 'C:\Users\hp\.pyxbld\lib.win32-2.7\gensim\models\word2vec_inner.pyd'

问题 I run the sourcecode of TWE model. I need to compile the C extension of python. I have installed the Microsoft Visual C++ Compiler for Python 2.7 and Cython. First, I need to run the TWE/train.py: import gensim sentence_word = gensim.models.word2vec.LineSentence("tmp/word.file") print "Training the word vector..." w = gensim.models.Word2Vec(sentence_word,size=400, workers=20) sentence = gensim.models.word2vec.CombinedSentence("tmp/word.file","tmp/topic.file") print "Training the topic vector.

Using freebase vectors with gensim

阅读更多关于 Using freebase vectors with gensim

问题 I am trying to use the freebase word embeddings released by Google, but I have a hard time getting the words from the freebase name. model = gensim.models.Word2Vec.load_word2vec_format('freebase-vectors-skipgram1000.bin',binary=True) model.vocab.keys()[:10] Out[22]: [u'/m/026tg5z', u'/m/018jz8', u'/m/04klsk', u'/m/08gd39', u'/m/0kt94', u'/m/05mtf0t', u'/m/05tjjb', u'/m/01m3vn', u'/m/0h7p35', u'/m/03ggvg3'] Does anyone know if it exist some kind of table to map the freebase representations

Load Word2Vec model in Spark

阅读更多关于 Load Word2Vec model in Spark

问题 Is it possible to load a pretrained (binary) model to spark (using scala) ? I have tried to load one of the binary models which was generated by google like this: import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val model = Word2VecModel.load(sc, "GoogleNews-vectors-negative300.bin") but it is not able to locate the metadata directory. I also created the folder and appended the binary file there but it cannot be parsed. I did not find any wrapper for this issue. 回答1: I wrote a

'utf-8' decode error when loading a word2vec module

阅读更多关于 'utf-8' decode error when loading a word2vec module

问题 I have to use a word2vec module containing tons of Chinese characters. The module was trained by my coworkers using Java and is saved as a bin file. I installed gensim and tries to load the module, but following error occurred: In [1]: import gensim In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected

I used word2vec in deeplearning4j to train word vectors, but those vectors are unstable

阅读更多关于 I used word2vec in deeplearning4j to train word vectors, but those vectors are unstable

1.I use IntelliJ IDEA build a maven project,code is as follows： System.out.println("Load data...."); SentenceIterator iter = new LineSentenceIterator(new File("/home/zs/programs/deeplearning4j-master/dl4j-test-resources/src/main/resources/raw_sentences.txt")); iter.setPreProcessor(new SentencePreProcessor() { @Override return sentence.toLowerCase(); } }); System.out.println("Build model...."); int batchSize = 1000; int iterations = 30; int layerSize = 300; com.sari.Word2Vec vec= new com.sari.Word2Vec.Builder() .batchSize(batchSize) //# words per minibatch. .sampling(1e-5) // negative sampling.

FastText using pre-trained word vector for text classification

阅读更多关于 FastText using pre-trained word vector for text classification

问题 I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels. I have tried using fast-text library by Facebook, which has two utilities of interest to me: A) Word Vectors with pre-trained models B) Text Classification utilities However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities. What I want is to be able to classify some text, by taking advantage of the

NLP系列1：NER

阅读更多关于 NLP系列1：NER

写在前面：在初学nlp时的第一个任务——NER，尝试了几种方法，cnn+crf、lstm+crf、bert+lstm+crf，毫无疑问，最后结果时Bert下效果最好。 1、关于NER：　　NER即命名实体识别是信息提取的一个子任务，但究其本质就是序列标注任务。　　eg：　　sentence：壹　　叁　去参加一个 NER 交流会　　tag:　　　B_PER I_PER O O O O O B_ORG I_ORG I_ORG I_ORG 　　(咱们暂且认为这个实体标记是正确的) 　　按照CoNLL2003任务中，LOC-地名、PER-人名、ORG-机构名、MISC-其他实体，其他此被标记为O，B代表实体名的开始位置，I代表实体名的中间位置。（在具体任务中标记不一，按照自己的任务中进行标记即可）　　NER是一个基础问题，不会不行，但是也是一个非常重要的问题，下面将按照实现过程中碰到的问题依次进行阐述(小白，如有错误，请疯狂留言喷我，一定改正)。首先的明白NER是一个分类任务，也叫序列标注，其实就是对文本的不同实体标记上对应的标签。方法主要如下：基于规则：按照语义学的先验定义规则，由于语言的复杂性，这个方法很难制定出良好的规则；基于统计学：统计学就是根据大量文本，在其中找到规律，代表作有HMM、CRF；神经网络：神经网络的大放异彩在各个领域都点亮了一片天空

Does or will H2O provide any pretrained vectors for use with h2o word2vec?

阅读更多关于 Does or will H2O provide any pretrained vectors for use with h2o word2vec?

问题 H2O recently added word2vec in its API. It is great to be able to easily train your own word vectors on a corpus you provide yourself. However even greater possibilities exist from using big data and big computers, of the type that software vendors like Google or H2O.ai, but not so many end-users of H2O, may have access to, due to network bandwidth and compute power limitations. Word embeddings can be seen as a type of unsupervised learning. As such, great value can be had in a data science