word2vec

Word embeddings for the same word from two different texts

吃可爱长大的小学妹 提交于 2021-02-19 09:03:08
问题 If I calculate word2vec for the same word (say, "monkey"), one time on the basis of one large text from the year 1800 and another time on the basis of one large text from the year 2000, then the results would not be comparable from my point of view. Am I right? And why is it so? I have the following idea: the text from the past may have complete different vocabulary, which is the problem. But how one can then cure it (make embeddings comparable)? Thanks in advance. 回答1: There's no "right"

Text similarity using Word2Vec

你说的曾经没有我的故事 提交于 2021-02-19 05:36:05
问题 I would like to use Word2Vec to check similarity of texts. I am currently using another logic: from fuzzywuzzy import fuzz def sim(name, dataset): matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1) return (name is my column). For applying this function I do the following: df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1) Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset? Example of dataset:

Extracting, transforming and selecting features

让人想犯罪 __ 提交于 2021-02-17 17:59:38
This section covers algorithms for working with features, roughly divided into these groups 本节介绍使用功能的算法,大致分为以下几组: 提取: 从数据中抽取特征。 转变: Scaling, converting, or modifying features 选择: 在多个特征中挑选比较重要的特征。 局部敏感哈希(LSH): 这类算法将特征变换的各个方面与其他算法结合起来。 Table of Contents Feature Extractors 特征提取 TF-IDF Word2Vec CountVectorizer Feature Transformers 特征变换 Tokenizer 分词器 StopWordsRemover 停用字清除 n n -gram Binarizer 二元化方法 PCA 主成成分分析 PolynomialExpansion 多项式扩展 Discrete Cosine Transform (DCT-离散余弦变换) StringIndexer 字符串-索引变换 IndexToString 索引-字符串变换 OneHotEncoder 独热编码 VectorIndexer 向量类型索引化 Interaction Normalizer 范数p-norm规范化

how calculate distance between 2 node2vec model

北城余情 提交于 2021-02-11 14:06:43
问题 I have 2 node2vec models in different timestamps. I want to calculate the distance between 2 models. Two models have the same vocab and we update the models. My models are like this model1: "1":0.1,0.5,... "2":0.3,-0.4,... "3":0.2,0.5,... . . . model2: "1":0.15,0.54,... "2":0.24,-0.35,... "3":0.24,0.47,... . . . 回答1: Assuming you've used a standard word2vec library to train your models, each run bootstraps a wholly-separate model whose coordinates are not necessarily comparable to any other

Creating sequence vector from text in Python

风格不统一 提交于 2021-02-11 11:45:21
问题 I am now trying to prepare the input data for LSTM-based NN. I have some big number of text documents and what i want is to make sequence vectors for each document so i am able to feed them as train data to LSTM RNN. My poor approach: import re import numpy as np #raw data train_docs = ['this is text number one', 'another text that i have'] #put all docs together train_data = '' for val in train_docs: train_data += ' ' + val tokens = np.unique(re.findall('[a-zа-я0-9]+', train_data.lower()))

NLP笔记:词向量和语言模型

偶尔善良 提交于 2021-02-11 09:15:34
NLP问题如果要转化为机器学习问题,第一步是要找一种方法把这些符号数学化。 有两种常见的表示方法:    One-hot Representation,这种方法把每个词表示为一个很长的向量。这个向量的维度是词表大小,其中绝大多数元素为 0,只有一个维度的值为 1,这个维度就代表了当前的词。例如[0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0]。这种表示方法会造成“词汇鸿沟”现象:不能反映词与词之间的语义关系,因为任意两个词都是正交的;而且,这种表示的维度很高。   Distributed Representation,表示的一种低维实数向量,维度以 50 维和 100 维比较常见,这种向量的表示不是唯一的。例如:[0.792, −0.177, −0.107, 0.109, −0.542, …]。这种方法最大的贡献就是让相关或者相似的词,在距离上更接近了。向量的距离可以用最传统的欧氏距离来衡量,也可以用 cos 夹角来衡量。       如果用传统的稀疏表示法表示词,在解决某些任务的时候(比如构建语言模型)会造成维数灾难。使用低维的词向量就没这样的问题。同时从实践上看,高维的特征如果要使用 Deep Learning,其复杂度太高,因此低维的词向量使用的更多。 并且 ,相似词的词向量距离相近,这就让基于词向量设计的一些模型自带平滑功能。 word2vec 是

自然语言处理--Word2vec(二)

放肆的年华 提交于 2021-02-11 09:15:01
前一篇, word2vec(一) 主要讲了word2vec一些表层概念,以及主要介绍CBOW方法来求解词向量模型,这里主要讲论文 Distributed Representations of Words and Phrases and their Compositionality 中的skip-gram model方法,这可以被视作为一种概率式方法。 前面有一篇讲过自然语言处理的词频处理方法即TF-IDF,这种方法往往只是可以找出一篇文章中比较关键的词语,即找出一些主题词汇。但无法给出词汇的语义,比如同义词漂亮和美丽意思差不多应该相近,巴黎之于法国等同于北京之于中国。对于一句话,如何根据上下文推断出中间的词语是什么,或者由某一个词推测出它的上下文一般是什么词语。这两种不同的思考方式正好对应两种Word2vec模型,即CBOW模型和Skip-gram模型。      词向量即将字词从文字空间映射到向量空间,每一个字词都会有一个对应的代表其语义的向量。我们可以用传统的N-gram方法来得到向量,即统计方法,如           对于每一个单词,都可以根据词频来得出一个对应的向量,也是根据上下文得出,有一定的语义,但是这种方式的弊端是随着语料库中词语越多,模型参数越大,假设有N个词语,则得到的模型参数为N^2,如果N很大,则非常浪费内存,而且很多词语之间本身是不相关的,即很多位置都是0

自然语言处理词向量模型-word2vec

杀马特。学长 韩版系。学妹 提交于 2021-02-11 08:30:39
自然语言处理与深度学习: 语言模型: N-gram模型: N-Gram模型:在自然语言里有一个模型叫做n-gram,表示文字或语言中的n个连续的单词组成序列。在进行自然语言分析时,使用n-gram或者寻找常用词组,可以很容易的把一句话分解成若干个文字片段 词向量: 神经网络模型: 注:初始化向量,可以先随机初始化。 传统神经神经网络只需要优化输入层与隐层,隐层与输出层之间的参数。 神经网络模型的优势:一方面可以得到词语之间近似的含义,另一方面求解出的空间符合真实逻辑规律 CBOW求解目标: 预备知识: 树的带权路径长度规定为所有叶子结点的带权路径长度之和,记为WPL。 分层的softmax设计思想:词频中出现词概率高的尽可能往前放,可以用哈夫曼树来设计。 自然语言哈夫曼树详解,包含构造和编码:https://blog.csdn.net/shuangde800/article/details/7341289 Hierarchical Softmax是用哈夫曼树构造出很多个二分类。 负采样模型: 来源: oschina 链接: https://my.oschina.net/u/4396372/blog/3912941

Word2vec - get rank of similarity

跟風遠走 提交于 2021-02-10 12:58:41
问题 Given I got a word2vec model (by gensim), I want to get the rank similarity between to words. For example, let's say I have the word "desk" and the most similar words to "desk" are: table 0.64 chair 0.61 book 0.59 pencil 0.52 I want to create a function such that: f(desk,book) = 3 Since book is the 3rd most similar word to desk. Does it exists? what is the most efficient way to do this? 回答1: You can use the rank(entity1, entity2) to get the distance - same as the index. model.wv.rank(sample

Word2vec - get rank of similarity

心不动则不痛 提交于 2021-02-10 12:57:05
问题 Given I got a word2vec model (by gensim), I want to get the rank similarity between to words. For example, let's say I have the word "desk" and the most similar words to "desk" are: table 0.64 chair 0.61 book 0.59 pencil 0.52 I want to create a function such that: f(desk,book) = 3 Since book is the 3rd most similar word to desk. Does it exists? what is the most efficient way to do this? 回答1: You can use the rank(entity1, entity2) to get the distance - same as the index. model.wv.rank(sample