word2vec | 易学教程

Failed to load a .bin.gz pre trained words2vecx

阅读更多关于 Failed to load a .bin.gz pre trained words2vecx

问题 I'm trying to load the pre-trained words2vecs which I've found here (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) I used the following command: model = gensim.models.KeyedVectors.load_word2vec_format('word2vec.bin.gz', binary=False) And it throws this error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/deeplearning/anaconda3/lib/python3.6/site- packages/gensim/models/keyedvectors.py", line 193, in load_word2vec_format header = utils.to_unicode

tensorflow Word2Vec error

阅读更多关于 tensorflow Word2Vec error

问题 I downloaded source code of word2vec in github below. https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py I am using tensorflow on pycharm. I'm using windows 10. I installed tensorflow, python, numpy which are needed to use tensorflow on windows. In word2vec.py source code, I set the savepath , trainpath , and evalpath . I downloaded the training text file from http://mattmahoney.net/dc/text8.zip which the source code recommended. But when I ran the code I get the

AWS EC2 搭建 Hadoop 和 Spark 集群

阅读更多关于 AWS EC2 搭建 Hadoop 和 Spark 集群

前言本篇演示如何使用 AWS EC2 云服务搭建集群。当然在只有一台计算机的情况下搭建完全分布式集群，还有另外几种方法：一种是本地搭建多台虚拟机，好处是免费易操控，坏处是虚拟机对宿主机配置要求较高，我就一台普通的笔记本，开两三个虚拟机实在承受不起；另一种方案是使用 AWS EMR ，是亚马逊专门设计的集群平台，能快速启动集群，且具有较高的灵活性和扩展性，能方便地增加机器。然而其缺点是只能使用预设的软件，如下图：如果要另外装软件，则需要使用 Bootstrap 脚本，详见 https://docs.aws.amazon.com/zh_cn/emr/latest/ManagementGuide/emr-plan-software.html?shortFooter=true ，可这并不是一件容易的事情，记得之前想在上面装腾讯的 Angel 就是死活都装不上去。另外，如果在 EMR 上关闭了集群，则里面的文件和配置都不会保存，下次使用时全部要重新设置，可见其比较适用于一次性使用的场景。综上所述，如果使用纯 EC2 进行手工搭建，则既不会受本地资源限制，也具有较高的灵活性，可以随意配置安装软件。而其缺点就是要手工搭建要耗费较多时间，而且在云上操作和在本地操作有些地方是不一样的，只要有一步出错可能就要卡壳很久，鉴于网上用 EC2 搭建这方面资料很少

Using Word2Vec for polysemy solving problems

阅读更多关于 Using Word2Vec for polysemy solving problems

问题 I have some questions about Word2Vec: What determines the dimension of the result model vectors? What is elements of this vectors? Can I use Word2Vec for polysemy solving problems (state = administrative unit vs state = condition), if I already have texts for every meaning of words? 回答1: (1) You pick the desired dimensionality, as a meta-parameter of the model. Rigorous projects with enough time may try different sizes, to see what works best for their qualitative evaluations. (2) Individual

Kaggle word2vec competition, part 2

阅读更多关于 Kaggle word2vec competition, part 2

问题 my code is FROM: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors, i read the data successful, here is used to BeautifulSoup and nltk to clean the text, remove non-letters but numbers. def review_to_wordlist( review, remove_stopwords=False ): # Function to convert a document to a sequence of words, # optionally removing stop words. Returns a list of words. # # 1. Remove HTML review_text = BeautifulSoup(review).get_text() # # 2. Remove non-letters review_text = re.sub

Exceeding spark.akka.frameSize when saving Word2VecModel

阅读更多关于 Exceeding spark.akka.frameSize when saving Word2VecModel

问题 I am using Spark's Word2Vec to train some word vectors. The training is essentially working but when it comes to saving the model I am getting a org.apache.spark.SparkException saying: Job aborted due to stage failure: Serialized task 1278:0 was 1073394582 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values. The stack trace points at line 190, but there is a

Python Gensim word2vec vocabulary key

阅读更多关于 Python Gensim word2vec vocabulary key

问题 I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode. # -*- encoding:utf-8 -*- # !/usr/bin/env python import sys reload(sys) sys.setdefaultencoding('utf-8') from gensim.models import Word2Vec import pprint with open('parsed_data.txt', 'r') as f: corpus = map(unicode, f.read().split('\n')) model = Word2Vec(size=128, window=5, min_count=5, workers=4) model.build_vocab(corpus,keep_raw_vocab=False) model.train(corpus) model.save('w2v')

spacy similarity method doesn't not work correctly

阅读更多关于 spacy similarity method doesn't not work correctly

问题 I always get a lot of help from stack overflows. Thank you all the time. I am doing simple natural language processing using spacy . I'm working on filtering out words by measuring the similarity between words. I wrote and used the following simple code shown in the spacy documentation, but the result does not look like a documentation. import spacy nlp = spacy.load('en_core_web_lg') tokens = nlp('dog cat banana') for token1 in tokens: for token2 in tokens: sim = token1.similarity(token2)

why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

阅读更多关于 why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

问题 In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions: why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix. why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing

文本分类实战（一）—— word2vec预训练词向量

阅读更多关于文本分类实战（一）—— word2vec预训练词向量

1 大纲概述　　文本分类这个系列将会有十篇左右，包括基于word2vec预训练的文本分类，与及基于最新的预训练模型（ELMo，BERT等）的文本分类。总共有以下系列：　　 word2vec预训练词向量　　 textCNN 模型　　 charCNN 模型　　 Bi-LSTM 模型　　 Bi-LSTM + Attention 模型　　 RCNN 模型　　 Adversarial LSTM 模型　　 Transformer 模型　　 ELMo 预训练模型　　 BERT 预训练模型　　所有代码均在 textClassifier 仓库中。 2 数据集　　数据集为IMDB 电影影评，总共有三个数据文件，在/data/rawData目录下，包括unlabeledTrainData.tsv，labeledTrainData.tsv，testData.tsv。在进行文本分类时需要有标签的数据（labeledTrainData），但是在训练word2vec词向量模型（无监督学习）时可以将无标签的数据一起用上。 3 数据预处理　　IMDB 电影影评属于英文文本，本序列主要是文本分类的模型介绍，因此数据预处理比较简单，只去除了各种标点符号，HTML标签，小写化等。代码如下： import pandas as pd from bs4 import BeautifulSoup