word2vec

How to use Gensim doc2vec with pre-trained word vectors?

北战南征 提交于 2019-11-28 03:02:27
I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec? Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training? Thanks. gojomo Note that the "DBOW" ( dm=0 ) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode). (Before gensim 0.12.0, there was the parameter train_words mentioned in another comment,

using Word2VecModel.transform() does not work in map function

落爺英雄遲暮 提交于 2019-11-28 00:19:11
I have built a Word2Vec model using Spark and save it as a model. Now, I want to use it in another code as offline model. I have loaded the model and used it to present vector of a word (e.g. Hello) and it works well. But, I need to call it for many words in an RDD using map. When I call model.transform() in a map function, it throws this error: "It appears that you are attempting to reference SparkContext from a broadcast " Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver,

【NLP】彻底搞懂BERT

不打扰是莪最后的温柔 提交于 2019-11-27 23:33:37
# 好久没更新博客了,有时候随手在本上写写,或者Evernote上记记,零零散散的笔记带来零零散散的记忆o(╥﹏╥)o。。还是整理到博客上比较有整体性,也方便查阅~ 自google在2018年10月底公布BERT在11项nlp任务中的卓越表现后,BERT(Bidirectional Encoder Representation from Transformers)就成为NLP领域大火、整个ML界略有耳闻的模型,网上相关介绍也很多,但很多技术内容太少,或是写的不全面半懂不懂,重复内容占绝大多数 (这里弱弱吐槽百度的搜索结果多样化。。) 一句话概括,BERT的出现,彻底改变了 预训练产生词向量 和 下游具体NLP任务 的关系,提出龙骨级的训练词向量概念。 目录:   词向量模型:word2vec, ELMo, BERT比较   BERT细则:Masked LM, Transformer, sentence-level   迁移策略:下游NLP任务调用接口   运行结果:破11项NLP任务最优纪录 一、词向量模型 这里主要横向比较一下word2vec,ELMo,BERT这三个模型,着眼在模型亮点与差别处。 传统意义上来讲,词向量模型是一个工具,可以把真实世界抽象存在的文字转换成可以进行数学公式操作的向量,而对这些向量的操作,才是NLP真正要做的任务。因而某种意义上,NLP任务分成两部分,

著名的「词类比」现象可能只是一场高端作弊

岁酱吖の 提交于 2019-11-27 20:48:25
「词类比」可谓是自然语言处理领域最为人津津乐道的经典案例之一。然而,进来一系列针对词类比现象的理论依据的讨论似乎要将这一明星案例拉下神坛。然而,无论结果如何,这一场围绕爆炸新闻和真理的大讨论都大大吸引了人们对于自然语言处理领域的关注,激发了大家的研究热情! 自然语言处理(NLP)是现代机器学习工具的重要应用领域之一。它涉及到使用数字化的工具来分析、解释、甚至生成人类(自然的)语言。 目前,NLP 领域最著名的算法莫过于「Word2Vec」,几乎所有该领域的从业者都知道它(甚至许多对机器学习感兴趣,但不研究 NLP 的人也知道它)。WordVec 有几种不同的实现方式,非常易于使用。在许多机器学习/人工智能或 NLP 入门课程中,往往会将其作为一个教学示例。 人们喜欢它的一个主要原因是:它似乎非常直观。通常,Word2Vec 的名气是由一些吸引眼球的、直观构建的例子得来的,这些例子常常被用来展示 Word2Vec 的能力。下面,我们简要介绍一下 Word2Vec 的工作原理: Word2Vec 会查看大量的文本,然后统计哪些词会经常与其它单词一同出现。基于这种词共现统计,Word2Vec 会为每个单词生成抽象表征,也就是所谓的词嵌入。词嵌入是一些低维向量(可以想象成一个包含 200 或 300 个数字的列表)。有了这些词向量,你就可以用单词做一些「神奇」的数学运算了!当我们拥有「国王

How to speed up Gensim Word2vec model load time?

﹥>﹥吖頭↗ 提交于 2019-11-27 20:12:50
I'm building a chatbot so I need to vectorize the user's input using Word2Vec. I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300). So I load the model using Gensim: import gensim model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long. So what can I do to speed up the load time? I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would

Doc2vec: How to get document vectors

本小妞迷上赌 提交于 2019-11-27 19:49:37
问题 How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial I am using gensim. doc1=["This is a sentence","This is another sentence"] documents1=[doc.strip().split(" ") for doc in doc1 ] model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4) I get AttributeError: 'list' object has no attribute 'words' whenever I run this. 回答1: If you

word2vec 构建中文词向量

若如初见. 提交于 2019-11-27 19:47:23
词向量作为文本的基本结构——词的模型,以其优越的性能,受到自然语言处理领域研究人员的青睐。良好的词向量可以达到语义相近的词在词向量空间里聚集在一起,这对后续的文本分类,文本聚类等等操作提供了便利,本文将详细介绍如何使用word2vec构建中文词向量。 一、中文语料库 本文采用的是搜狗实验室的搜狗新闻语料库,数据链接 http://www.sogou.com/labs/resource/cs.php 下载下来的文件名为: news_sohusite_xml.full.tar.gz 二、数据预处理 2.1 解压并查看原始数据 cd 到原始文件目录下,执行解压命令: tar -zvxf news_sohusite_xml.full. tar .gz 得到文件 news_sohusite_xml.dat, 用vim打开该文件, vim news_sohusite_xml.dat 得到如下结果: 2.2 取出内容 取出<content> </content> 中的内容,执行如下命令: cat news_tensite_xml.dat | iconv -f gbk -t utf- 8 -c | grep " <content> " > corpus.txt 得到文件名为corpus.txt的文件,可以通过vim 打开 vim corpus.txt 得到如下效果: 2.3 分词 注意

Update gensim word2vec model

只愿长相守 提交于 2019-11-27 13:07:29
I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this: new_sentence = ['moscow', 'weather', 'cold'] model.train(new_sentence) and its printing this as logs: 2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features 2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs

文本相似度及案例-语义分析算法学习

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-27 13:05:30
在做自然语言处理的过程中,我们经常会遇到需要找出相似语句的场景,或者找出句子的近似表达,这时候就需要把类似的句子归到一起,这里面就涉及到句子相似度计算的问题。 基本方法 句子相似度计算一共归类了以下几种方法: 编辑距离计算 杰卡德系数计算 TF 计算 TF-IDF 计算 Word2Vec 计算 下面来一一了解一下这几种算法的原理和 Python 实现。 编辑距离计算 编辑距离,英文叫做 Edit Distance,又称 Levenshtein 距离,是指两个字串之间,由一个转成另一个所需的最少编辑操作次数,如果它们的距离越大,说明它们越是不同。许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。 例如我们有两个字符串:string 和 setting,如果我们想要把 string 转化为 setting,需要这么两步: 第一步,在 s 和 t 之间加入字符 e。 第二步,把 r 替换成 t。 所以它们的编辑距离差就是 2,这就对应着二者要进行转化所要改变(添加、替换、删除)的最小步数。 那么用 Python 怎样来实现呢,我们可以直接使用 distance 库: #编辑距离 import distance def edit_distance(s1, s2): return distance.levenshtein(s1, s2) strings = [

Using pre-trained word2vec with LSTM for word generation

时光毁灭记忆、已成空白 提交于 2019-11-27 11:03:27
LSTM/RNN can be used for text generation. This shows way to use pre-trained GloVe word embeddings for Keras model. How to use pre-trained Word2Vec word embeddings with Keras LSTM model? This post did help. How to predict / generate next word when the model is provided with the sequence of words as its input? Sample approach tried: # Sample code to prepare word2vec word embeddings import gensim documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system