gensim

维基百科语料中的词语相似度探索

有些话、适合烂在心里 提交于 2020-02-22 08:47:30
之前写过《 中英文维基百科语料上的Word2Vec实验 》,近期有不少同学在这篇文章下留言提问,加上最近一些工作也与 Word2Vec 相关,于是又做了一些功课,包括重新过了一遍Word2Vec的相关资料,试了一下gensim的相关更新接口,google了一下" wikipedia word2vec " or " 维基百科 word2vec " 相关的英中文资料,发现多数还是走得这篇文章的老路,既通过gensim提供的维基百科预处理脚本"gensim.corpora.WikiCorpus"提取维基语料,每篇文章一行文本存放,然后基于 gensim 的Word2Vec模块训练词向量模型。这里再提供另一个方法来处理维基百科的语料,训练词向量模型,计算词语相似度( Word Similarity )。关于Word2Vec, 如果英文不错,推荐从这篇文章入手读相关的资料: Getting started with Word2Vec 。 这次我们仅以英文维基百科语料为例,首先依然是下载维基百科的最新XML打包压缩数据,在这个英文最新更新的数据列表下: https://dumps.wikimedia.org/enwiki/latest/ ,找到 " enwiki-latest-pages-articles.xml.bz2 " 下载,这份英文维基百科全量压缩数据的打包时间大概是2017年4月4号

OSError: Not a gzipped file (b've') python

冷暖自知 提交于 2020-02-06 07:56:23
问题 I have the following code and I made sure its extension and name are correct. However, I still get the error outputted as seen below. I did see another person asked a similar question here on Stack Overflow, and read the answer but it did not help me. Failed to load a .bin.gz pre trained words2vecx Any suggestions how to fix this? Input: import gensim word2vec_path = "GoogleNews-vectors-negative300.bin.gz" word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

gensim加载词向量文件

♀尐吖头ヾ 提交于 2020-02-01 12:18:02
# -*- coding: utf-8 -*- # author: huihui # date: 2020/1/31 7:58 下午 ''' 根据语料训练词向量,并保存向量文件 ''' import os import sys import gensim os.reload(sys) sys.setdefaultencoding('utf-8') # 需要提前分词 input_file = "corp_seg.txt" sentences = gensim.models.word2vec.Text8Corpus(input_file) # 训练词向量 model = gensim.models.word2vec.Word2Vec(sentences, sg=1, size=100, window=5, min_count=1, negative=3, sample=0.001, hs=1, workers=40) # 保存词向量文件 model.save("corp_word2vec.model") model.wv.save_word2vec_format("corp_word2vec.txt") # 加载词向量文件 model = gensim.models.word2vec.Word2Vec.load("corp_word2vec.model") model = gensim

Word2Vec实践

拥有回忆 提交于 2020-01-31 16:29:36
Word2Vec实践 1 gensim word2vec API概述 2 模型训练 1、读取数据 2、数据预处理 3、模型训练 4、效果测试 3 与Fasttext对比 1 Fasttext简介 2 Fasttext模型训练 3 两者对比 之前了解过Word2Vec的原理,但是没有做过项目实践,这次得到一批专利数据,所以自己上手实践一下。 数据参考: https://github.com/newzhoujian/LCASPatentClassification 1 gensim word2vec API概述 在gensim中,word2vec 相关的API都在包gensim.models.word2vec中。和算法有关的参数都在类gensim.models.word2vec.Word2Vec中。算法需要注意的参数有: 1) sentences: 我们要分析的语料,可以是一个列表,或者从文件中遍历读出。后面我们会有从文件读出的例子。 2) size: 词向量的维度,默认值是100。这个维度的取值一般与我们的语料的大小相关,如果是不大的语料,比如小于100M的文本语料,则使用默认值一般就可以了。如果是超大的语料,建议增大维度。 3) window:即词向量上下文最大距离,这个参数在我们的算法原理篇中标记为c,window越大,则和某一词较远的词也会产生上下文关系。默认值为5

How to use the infer_vector in gensim.doc2vec?

若如初见. 提交于 2020-01-29 05:29:13
问题 def cosine(vector1,vector2): cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2)) return cosV12 model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game') string='民生 为了 父亲 我 要 坚强 地 ...' list=string.split(' ') vector1=model.infer_vector(doc_words=list,alpha=0.1, min_alpha=0.0001,steps=5) vector2=model.docvecs.doctag_syn0[0] print cosine(vector2,vector1) -0.0232586 I use a train data to train a doc2vec model. Then, I use infer_vector() to generate a vector given a

word2vec - KeyError: “word X not in vocabulary”

大城市里の小女人 提交于 2020-01-25 08:22:27
问题 Using the Word2Vec implementation of the module gensim in order to construct word embeddings for the sentences I do have in a plain text file. Despite the word happy is defined in the vocabulary, getting the error KeyError: "word 'happy' not in vocabulary" . Tried to apply the given the answers to a similar question, but did not work. Hence, posted my own question. Here is the code: try: data = [] with open(TXT_PATH, 'r', encoding='utf-8') as txt_file: for line in txt_file: for part in line

Gensim: Any chance to get word frequency in Word2Vec format?

旧时模样 提交于 2020-01-25 06:54:28
问题 I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get? I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency. 回答1: I don't believe those formats include any word frequency information. To the extent any pre-trained word

Gensim: Any chance to get word frequency in Word2Vec format?

て烟熏妆下的殇ゞ 提交于 2020-01-25 06:54:01
问题 I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get? I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency. 回答1: I don't believe those formats include any word frequency information. To the extent any pre-trained word

Text Classification(Spacy) in place of Gensim

佐手、 提交于 2020-01-25 06:47:09
问题 Hello i am using gemsin library for semantic text similarity classification but i am failed to load the data of gemsin file even it takes too much time to execute the program when we use jupyter notebbok and run cells. So my question is that can we use spacy library to overcome this type of error and can we fount out the similarity between two document files.i have seen tf-idf for semantic similarity here is error MemoryError: Unable to allocate 3.35 GiB for an array with shape (3000000, 300)