word2vec

Failed to load a .bin.gz pre trained words2vecx

本秂侑毒 提交于 2019-12-25 09:31:07
问题 I'm trying to load the pre-trained words2vecs which I've found here (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) I used the following command: model = gensim.models.KeyedVectors.load_word2vec_format('word2vec.bin.gz', binary=False) And it throws this error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/deeplearning/anaconda3/lib/python3.6/site- packages/gensim/models/keyedvectors.py", line 193, in load_word2vec_format header = utils.to_unicode

tensorflow Word2Vec error

让人想犯罪 __ 提交于 2019-12-25 08:38:09
问题 I downloaded source code of word2vec in github below. https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py I am using tensorflow on pycharm. I'm using windows 10. I installed tensorflow, python, numpy which are needed to use tensorflow on windows. In word2vec.py source code, I set the savepath , trainpath , and evalpath . I downloaded the training text file from http://mattmahoney.net/dc/text8.zip which the source code recommended. But when I ran the code I get the

AWS EC2 搭建 Hadoop 和 Spark 集群

笑着哭i 提交于 2019-12-24 21:31:58
前言 本篇演示如何使用 AWS EC2 云服务搭建集群。当然在只有一台计算机的情况下搭建完全分布式集群,还有另外几种方法:一种是本地搭建多台虚拟机,好处是免费易操控,坏处是虚拟机对宿主机配置要求较高,我就一台普通的笔记本,开两三个虚拟机实在承受不起; 另一种方案是使用 AWS EMR ,是亚马逊专门设计的集群平台,能快速启动集群,且具有较高的灵活性和扩展性,能方便地增加机器。然而其缺点是只能使用预设的软件,如下图: 如果要另外装软件,则需要使用 Bootstrap 脚本,详见 https://docs.aws.amazon.com/zh_cn/emr/latest/ManagementGuide/emr-plan-software.html?shortFooter=true ,可这并不是一件容易的事情,记得之前想在上面装腾讯的 Angel 就是死活都装不上去。 另外,如果在 EMR 上关闭了集群,则里面的文件和配置都不会保存,下次使用时全部要重新设置,可见其比较适用于一次性使用的场景。 综上所述,如果使用纯 EC2 进行手工搭建,则既不会受本地资源限制,也具有较高的灵活性,可以随意配置安装软件。而其缺点就是要手工搭建要耗费较多时间,而且在云上操作和在本地操作有些地方是不一样的,只要有一步出错可能就要卡壳很久,鉴于网上用 EC2 搭建这方面资料很少

Using Word2Vec for polysemy solving problems

旧街凉风 提交于 2019-12-24 17:50:15
问题 I have some questions about Word2Vec: What determines the dimension of the result model vectors? What is elements of this vectors? Can I use Word2Vec for polysemy solving problems (state = administrative unit vs state = condition), if I already have texts for every meaning of words? 回答1: (1) You pick the desired dimensionality, as a meta-parameter of the model. Rigorous projects with enough time may try different sizes, to see what works best for their qualitative evaluations. (2) Individual

Kaggle word2vec competition, part 2

試著忘記壹切 提交于 2019-12-24 17:15:48
问题 my code is FROM: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors, i read the data successful, here is used to BeautifulSoup and nltk to clean the text, remove non-letters but numbers. def review_to_wordlist( review, remove_stopwords=False ): # Function to convert a document to a sequence of words, # optionally removing stop words. Returns a list of words. # # 1. Remove HTML review_text = BeautifulSoup(review).get_text() # # 2. Remove non-letters review_text = re.sub

Exceeding spark.akka.frameSize when saving Word2VecModel

我的未来我决定 提交于 2019-12-24 15:22:36
问题 I am using Spark's Word2Vec to train some word vectors. The training is essentially working but when it comes to saving the model I am getting a org.apache.spark.SparkException saying: Job aborted due to stage failure: Serialized task 1278:0 was 1073394582 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values. The stack trace points at line 190, but there is a

Python Gensim word2vec vocabulary key

陌路散爱 提交于 2019-12-24 07:57:42
问题 I want to make word2vec with gensim. I heard that vocabulary corpus should be unicode so I converted it to unicode. # -*- encoding:utf-8 -*- # !/usr/bin/env python import sys reload(sys) sys.setdefaultencoding('utf-8') from gensim.models import Word2Vec import pprint with open('parsed_data.txt', 'r') as f: corpus = map(unicode, f.read().split('\n')) model = Word2Vec(size=128, window=5, min_count=5, workers=4) model.build_vocab(corpus,keep_raw_vocab=False) model.train(corpus) model.save('w2v')

spacy similarity method doesn't not work correctly

落爺英雄遲暮 提交于 2019-12-24 04:52:11
问题 I always get a lot of help from stack overflows. Thank you all the time. I am doing simple natural language processing using spacy . I'm working on filtering out words by measuring the similarity between words. I wrote and used the following simple code shown in the spacy documentation, but the result does not look like a documentation. import spacy nlp = spacy.load('en_core_web_lg') tokens = nlp('dog cat banana') for token1 in tokens: for token2 in tokens: sim = token1.similarity(token2)

why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?

六月ゝ 毕业季﹏ 提交于 2019-12-24 00:48:54
问题 In word2vec, after training, we get two weight matrixes:1.input-hidden weight matrix; 2.hidden-output weight matrix. and people will use the input-hidden weight matrix as the word vectors(each row corresponds to a word, namely, the word vectors).Here comes to my confusions: why people use input-hidden weight matrix as the word vectors instead of the hidden-output weight matrix. why don't we just add softmax activation function to the hidden layers rather than output layers, thus preventing

文本分类实战(一)—— word2vec预训练词向量

浪子不回头ぞ 提交于 2019-12-23 12:30:54
1 大纲概述   文本分类这个系列将会有十篇左右,包括基于word2vec预训练的文本分类,与及基于最新的预训练模型(ELMo,BERT等)的文本分类。总共有以下系列:    word2vec预训练词向量    textCNN 模型    charCNN 模型    Bi-LSTM 模型    Bi-LSTM + Attention 模型    RCNN 模型    Adversarial LSTM 模型    Transformer 模型    ELMo 预训练模型    BERT 预训练模型    所有代码均在 textClassifier 仓库中。 2 数据集   数据集为IMDB 电影影评,总共有三个数据文件,在/data/rawData目录下,包括unlabeledTrainData.tsv,labeledTrainData.tsv,testData.tsv。在进行文本分类时需要有标签的数据(labeledTrainData),但是在训练word2vec词向量模型(无监督学习)时可以将无标签的数据一起用上。 3 数据预处理   IMDB 电影影评属于英文文本,本序列主要是文本分类的模型介绍,因此数据预处理比较简单,只去除了各种标点符号,HTML标签,小写化等。代码如下: import pandas as pd from bs4 import BeautifulSoup