gensim

Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

与世无争的帅哥 提交于 2019-11-29 08:31:04
问题 I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) After loading the model I am converting training reviews sentence words into vectors #reading all sentences from training file with open('restaurantSentences', 'r') as infile: x_train = infile.readlines() #cleaning sentences x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train

Document topical distribution in Gensim LDA

不问归期 提交于 2019-11-29 03:19:44
I've derived a LDA topic model using a toy corpus as follows: documents = ['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey'] texts = [[word for word in document.lower().split()] for

What is the simplest way to get tfidf with pandas dataframe?

烈酒焚心 提交于 2019-11-28 18:49:52
I want to calculate tf-idf from the documents below. I'm using python and pandas. import pandas as pd df = pd.DataFrame({'docId': [1,2,3], 'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']}) First, I thought I would need to get word_count for each row. So I wrote a simple function: def word_count(sent): word2cnt = dict() for word in sent.split(): if word in word2cnt: word2cnt[word] += 1 else: word2cnt[word] = 1 return word2cnt And then, I applied it to each row. df['word_count'] = df['sent'].apply(word_count) But now I'm lost. I know there's an

Python 结巴分词 + Word2Vec利用维基百科训练词向量

流过昼夜 提交于 2019-11-28 18:21:43
结巴分词 是一个跨语言的中文分词器,整体效果还算不错,功能也够用,这里直接用Python了,其他主流语言版本均有提供。 Word2Vec,起源于谷歌的一个项目,在我刚开始接触的时候就关注到了他的神奇,大致是通过深度神经网络把词映射到N维空间,处理成向量之后我们终于可以在自然语言处理上方便的使用它进行一些后续处理。 Python的 gensim 库中有 word2vec 包,我们使用这个就可以了,接下来我们就对维基百科进行处理,作为训练集去训练。(包地址: http://radimrehurek.com/gensim/models/word2vec.html ) 本文参考: http://www.52nlp.cn/中英文维基百科语料上的word2vec实验 处理 使用维基百科的数据很方便,一是Wiki给我们提供了现成的语料库(听说是实时更新的),虽然中文体积不大,但比起自己爬来方便了不少。 如果使用英文那就更棒了,非常适合作为语料库。 当然只是说在通用的情况下,在专业词汇上,经过测试效果比较一般(考虑到专业词库有专业wiki,以及中文词条本身也不太多)。 首先,我们把Wiki处理成Text格式待处理的文本,这一步在本文参考中有现成的代码。 process_wiki_data.py #!/usr/bin/env python # -*- coding: utf-8 -*- #

Python Gensim: how to calculate document similarity using the LDA model?

前提是你 提交于 2019-11-28 16:30:26
问题 I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks! 回答1: Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query. dictionary = corpora.Dictionary.load('dictionary.dict') corpus = corpora

Doc2vec: How to get document vectors

 ̄綄美尐妖づ 提交于 2019-11-28 15:53:04
How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial I am using gensim. doc1=["This is a sentence","This is another sentence"] documents1=[doc.strip().split(" ") for doc in doc1 ] model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4) I get AttributeError: 'list' object has no attribute 'words' whenever I run this. If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and

Why are multiple model files created in gensim word2vec?

我与影子孤独终老i 提交于 2019-11-28 12:06:12
When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows. word2vec (File) word2vec.syn1nef.npy (NPY file) word2vec.wv.syn0.npy (NPY file) I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files). Please help me. Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store subsidiary arrays in separate files, using the more-efficient raw format of numpy arrays ( .npy format). You

Interpreting the sum of TF-IDF scores of words across documents

半城伤御伤魂 提交于 2019-11-28 09:04:28
First let's extract the TF-IDF scores per term per document: from gensim import corpora, models, similarities documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] stoplist =

PyTorch / Gensim - How to load pre-trained word embeddings

隐身守侯 提交于 2019-11-28 05:58:53
I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer. So my question is, how do I get the embedding weights loaded by gensim into the PyTorch embedding layer. Thanks in Advance! I just wanted to report my findings about loading a gensim embedding with PyTorch. Solution for PyTorch 0.4.0 and newer: From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable. Here is an example from the documentation. >> # FloatTensor containing pretrained weights >> weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]]) >>

How to print the LDA topics models from gensim? Python

半世苍凉 提交于 2019-11-28 04:01:17
Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a NoneType : Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable The code: from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip documents = ["Human machine