word2vec | 易学教程

FastText using pre-trained word vector for text classification

阅读更多关于 FastText using pre-trained word vector for text classification

I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels. I have tried using fast-text library by Facebook, which has two utilities of interest to me: A) Word Vectors with pre-trained models B) Text Classification utilities However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities. What I want is to be able to classify some text, by taking advantage of the pre-trained models of the Word-Vectors. Is there any way to do this? FastText's native classification

How to obtain antonyms through word2vec?

阅读更多关于 How to obtain antonyms through word2vec?

I am currently working on word2vec model using gensim in Python, and want to write a function that can help me find the antonyms and synonyms of a given word. For example: antonym("sad")="happy" synonym("upset")="enraged" Is there a way to do that in word2vec? In word2vec you can find analogies, the following way model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) model.most_similar(positive=['good', 'sad'], negative=['bad']) [(u'wonderful', 0.6414928436279297), (u'happy', 0.6154338121414185), (u'great', 0.5803680419921875), (u'nice', 0

python学习-文本数据分析1(主题提取+词向量化)

阅读更多关于 python学习-文本数据分析1(主题提取+词向量化)

原文地址： http://blog.sina.com.cn/s/blog_727a704c0102vn44.html 使用Python 进行简单文本类数据分析，包括： 1. 分词 2. 生成语料库，tfidf加权 3. lda主题提取模型 4. 词向量化word2vec 参考： http://zhuanlan.zhihu.com/textmining-experience/1963076 #!/usr/bin/env python # -*- coding:utf-8 -*- import MySQLdb import pandas as pd import pandas.io.sql as sql import jieba import nltk import jieba.posseg as pseg from gensim import corpora, models, similarities import re # import logging # logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INGO) # reload(sys) # sys.setdefaultencoding('utf-8') if __name__ == '__main__':

利用中文维基语料和Gensim训练 Word2Vec 的步骤

阅读更多关于利用中文维基语料和Gensim训练 Word2Vec 的步骤

word2vec 包括CBOW 和 Skip-gram，它的相关原理网上很多，这里就不多说了。简单来说，word2vec是自然语言中的字词转为计算机可以理解的稠密向量，是one-hot词汇表的降维表示，代表每个词的特征以及保持住了词汇间的关系。此处记录将中文词汇转为词向量的过程。 1. 下载中文语料中文的语料可以从维基百科下载，这些语料库经常会更新，但都很全面。中文语料下载地址：（ https://dumps.wikimedia.org/zhwikisource/20180620/ ）。因为我只是想熟悉这个过程，就只下了一个比较小的包，只有两百多兆。 2. 解析语料包从维基百科下载到的语料包是无法直接使用的，好在有人帮我们解决了这个问题。利用WikiExtractor抽取步骤1下载得到的语料原始包。WikiExtractor下载地址：（ https://github.com/attardi/wikiextractor ）。打开cmd，输入以下命令解析维基语料，当然首先要把路径切换到你保存预料包和WikiExtractor的路径： python WikiExtractor.py -b 400M -o extracted zhwiki-latest-pages-articles.xml.bz2 400M 代表提取出来的单个文件最大为 400M，这时会产生目录extracted

基于python的gensim word2vec训练词向量

阅读更多关于基于python的gensim word2vec训练词向量

准备工作当我们下载了anaconda后，可以在命令窗口通过命令 conda install gensim 安装gensim gensim介绍 gensim是一款强大的自然语言处理工具，里面包括N多常见模型，我们体验一下： interfaces – Core gensim interfaces utils – Various utility functions matutils – Math utils corpora .bleicorpus – Corpus in Blei’s LDA-C format corpora .dictionary – Construct word<->id mappings corpora .hashdictionary – Construct word<->id mappings corpora .lowcorpus – Corpus in List-of-Words format corpora .mmcorpus – Corpus in Matrix Market format corpora .svmlightcorpus – Corpus in SVMlight format corpora .wikicorpus – Corpus from a Wikipedia dump corpora .textcorpus – Building

fasttext简介

阅读更多关于 fasttext简介

fasttext的基础理论前言简介 fasttext是NLP里，一个非常高效的，基于词向量化的，用于文本分类的模型。虽然其原理比较简单，但是其中涉及到了不少的用于提速和准确率的小技巧。这篇文章主要从理论的层面（一直想有时间去扒源码来看看来着）介绍这些小技巧，而对于和word2vec部分中类似的地方会简单提到，但是不会展开说明（这个作者先提出的word2vec，后来提出的fasttext，二者有不少相似之处）Word2vec的相关内容参考peghoty所写的 word2vector中的数学原理详解.pdf [1]。当然本文做的介绍不可能面面俱到，而且很多地方也可能理解不准确，希望大家不吝赐教。正文 fasttext和word2vec中的CBOW非常类似，对于每一个文本而言，第一步是将所有单词向量化后作为输入；第二步是将输入的所有向量在隐藏层进行平均化处理得到新的向量；第三步输出预测值。接下来我们分别对这三部进行具体的解释。第一步：输入在word2vec中，它的输入就是单纯的把词袋向量化。但是在fasttext还加入了n-grams的思想。举个例子“我喜欢她“，如果只用这几个词的组合来反映这个句子，就是（”我”，”喜欢”，”她”），问题来了，句子“她喜欢我”的词的组合也是（”我”，”喜欢”，”她”），但这两个句子的意思完全不同

词向量（从one-hot到word2vec）

阅读更多关于词向量（从one-hot到word2vec）

词向量的意思就是通过一个数字组成的向量来表示一个词，这个向量的构成有很多种方法，如one-hot编码、基于共现矩阵的方式、word2vec、动态词向量ELMo等。一、one-hot向量优势：简单易懂、稀疏存储不足：维度灾难、词汇鸿沟（向量之间都是孤立的）二、基于共现矩阵的方式上述矩阵是一个n*n的对称矩阵X，矩阵维数随着词典数量n的增大而增大，可以使用奇异值分解SVD 将矩阵维度降低。但是仍存在问题：矩阵X的维度经常改变由于大部分词并不共现而导致的稀疏性矩阵维度过高带来的高计算复杂度三、基于神经网络的方式（word embedding）：word2vec Word2Vec通过Embedding层将One-Hot Encoder转化为低维度的连续值（稠密向量），并且其中意思相近的词将被映射到向量空间中相近的位置。从而解决了One-Hot Encoder词汇鸿沟和维度灾难的问题。 1.Embedding层 Embedding层（输入层到隐藏层）是以one hot为输入、中间层节点数为词向量维数的全连接层，这个全连接层的参数就是我们要获取的词向量表！ 2.Word2vec模型概述 word2vec其实就是简化版的NN，它事实上训练了一个语言模型，通过语言模型来获取词向量。所谓语言模型，就是通过前n个字预测下一个字的概率，就是一个多分类器而已

Gensim word2vec on predefined dictionary and word-indices data

阅读更多关于 Gensim word2vec on predefined dictionary and word-indices data

问题 I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I've seen on gensim my data is not raw, but has already been preprocessed. I have a dictionary in a text document containing 65k words (incl. an "unknown" token and a EOL token) and the tweets are saved as a numpy matrix with indices into this dictionary. A simple example of the data format can be seen below: dict.txt you love this code tweets (5 is unknown and 6 is EOL) [[0, 1, 2, 3, 6], [3, 5, 5

Loss does not decrease during training (Word2Vec, Gensim)

阅读更多关于 Loss does not decrease during training (Word2Vec, Gensim)

What can cause loss from model.get_latest_training_loss() increase on each epoch? Code, used for training: class EpochSaver(CallbackAny2Vec): '''Callback to save model after each epoch and show training parameters ''' def __init__(self, savedir): self.savedir = savedir self.epoch = 0 os.makedirs(self.savedir, exist_ok=True) def on_epoch_end(self, model): savepath = os.path.join(self.savedir, "model_neg{}_epoch.gz".format(self.epoch)) model.save(savepath) print( "Epoch saved: {}".format(self.epoch + 1), "Start next epoch ... ", sep="\n" ) if os.path.isfile(os.path.join(self.savedir, "model_neg{

How does gensim calculate doc2vec paragraph vectors

阅读更多关于 How does gensim calculate doc2vec paragraph vectors

问题 i am going thorugh this paper http://cs.stanford.edu/~quocle/paragraph_vector.pdf and it states that " Theparagraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use concatenation as the method to combine the vectors." How does concatenation or averaging work? example (if paragraph 1 contain word1 and word2): word1 vector =[0.1,0.2,0.3] word2 vector =[0.4,0.5,0.6] concat method does paragraph vector = [0.1+0.4,0.2+0.5,0.3+0

订阅 word2vec