nlp

NLP笔记(word embedding)

喜夏-厌秋 提交于 2020-01-28 16:18:02
目录 word embedding 语言表示 语言模型 词的分布式表示 word2vec 以前的word嵌入方法在今天仍然很重要 Word2Vec等方法的局限 针对NLP中的一些基本概念和知识,做一些摘记 word embedding 语言表示 语言表示 研究的是如何把自然语言文本转化为可以被算法模型处理的数据 目前使用得比较多的语言表示方法被称为:“基于词的分布式表示(distributed representation)” Harris 在 1954 年提出的 分布假说( distributional hypothesis) 为这一设想提供了理论基础: 上下文相似的词,其语义也相似 。 Firth 在 1957 年对分布假说进行了进一步阐述和明确:词的语义由其上下文决定( a word is characterized by thecompany it keeps) 语言模型 常见的语言模型包括:N元文法模型(N-gram Model)、unigram model、bigram model、trigram model等等 词的分布式表示 词的分布式表示方法一般分为3类,分别是: 基于矩阵的分布式表示,基于聚类的分布式表示,基于神经网络的分布式表示 常见到的 Global Vector 模型( GloVe模型) 是一种对“词-词”矩阵进行分解从而得到词表示的方法,属于

Google神经网络的对抗训练已获得专利,BERT到底关注的什么?...#20200115

大兔子大兔子 提交于 2020-01-28 13:37:23
写在前面 本系列文章分享笔者每天学习的一些圈内前沿有趣事件和开源工作,分享转需。 欢迎关注公众号AISphere。 目录简介 Google神经网络的对抗训练已获得专利 BERT到底关注的什么?斯坦福对BERT Attention的分析 开源界新项目发布-Cortex v0.12:面向开发人员的机器学习基础架构 Google神经网络的对抗训练已获得专利 最近Christian Szegedy和Ian Goodfellow为神经网络的对抗训练申请了专利(美国专利10521718) 此事件在reddit上发起了热烈的讨论: 提问者问此事件是否会让使用了神经网络对抗训练的公司和学术界更依赖google,但目前没有一个明确说法。 有一些客观评论我还是比较喜欢的,比如网友ReginaldIII的一个评论: 就像每次发生这种情况以及每次将其发布到此子目录一样,这毫无意义。 每次都在评论中来回争论是好/坏,道德/不道德,意图良好等,但最终只是人们的主观意见。 然后,我们每次继续使线程保持相同的状态,因为它不会对我们所做的工作或将继续进行的工作产生任何影响。 我不喜欢它(指专利制度),但是我理解为什么其他人认为它是必要的。专利制度及其相关的法律和知识产权已被滥用,通过多年的游说兴趣和金钱来雕刻。大多数人都在防御性地为这些概念申请专利,以应对专利制度中的重大缺陷。 以及rhiyo的:

机器学习数据集

假装没事ソ 提交于 2020-01-26 02:52:34
图像分类领域 1)MNIST 经典的小型(28x28 像素)灰度手写数字数据集,开发于 20 世纪 90 年代,主要用于测试当时最复杂的模型;到 了今日,MNIST 数据集更多被视作深度学习的基础教材。fast.ai 版本的数据集舍弃了原始的特殊二进制格式,转 而采用标准的 PNG 格式,以便在目前大多数代码库中作为正常的工作流使用;如果您只想使用与原始同样的单输入通道,只需在通道轴中选取单个切片即可。 引文: http://yann.lecun.com/exdb/publis/index.html#lecun-98 下载地址: https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz 2)CIFAR10 10 个类别,多达 60000 张的 32x32 像素彩色图像(50000 张训练图像和 10000 张测试图像),平均每种类别 拥有 6000 张图像。广泛用于测试新算法的性能。fast.ai 版本的数据集舍弃了原始的特殊二进制格式,转而采用标准的 PNG 格式,以便在目前大多数代码库中作为正常的工作流使用。 引文: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf 下载地址: https://s3.amazonaws.com/fast-ai

NLP 领域缩略词

岁酱吖の 提交于 2020-01-26 00:57:43
自然语言处理及计算语言学常见缩略词 参考链接: https://www.gooseeker.com/cn/node/knowledgebase/nlp_acronym 自然语言处理(NLP, natural language processing)及计算语言学(CL, Computational Linguistics)常见缩略词 ACL = Association for Computational Linguistics(计算语言学协会) AFNLP = Asian Federation of Natural Language Processing(亚洲自然语言处理联盟) AI = Artificial Intelligence(人工智能) ALPAC = Automated Language Processing Advisory Committee(语言自动处理咨询委员会) ASR = Automatic Speech Recognition(自动语音识别) CAT = Computer Assisted/Aided Translation(计算机辅助翻译) CBC = Clustering by Committee CCG = Combinatory Categorial Grammar(组合范畴语法) CICLing = International Conference

How to train the self-attention model?

前提是你 提交于 2020-01-25 09:25:08
问题 I understand the whole structure of transformer as in the figure below, but one thing confused me is the bottom of the decoder part which has the input of right-shifting outputs. For example, when training the model with a pair of two language sentences, let's say the input is the sentence "I love you", and the corresponding French is the "je t'aime". How does the model train? So the input of encoder is "I love you", for the decoder, there are two things, one is "je t'aime" which should be

Similarity between two lists of documents

老子叫甜甜 提交于 2020-01-25 08:57:06
问题 I need to find the similarity between two lists of the short texts in Python. Texts can be 1-4 word long. The length of the lists can be 10K each. I didn't find how to do this effectively in spaCy. Maybe other packages can do this? I assume the words are represented by a vector (300d), but any other options are also Ok. This task can be done in a cycle, but there should be a more effective way for sure. This task fits the TensorFlow, pyTorch, and similar packages, but I'm not familiar with

model.fit() Keras Classification Multiple Inputs-Single Output gives error: AttributeError: 'NoneType' object has no attribute 'fit'

痞子三分冷 提交于 2020-01-25 01:09:10
问题 I am constructing a Keras Classification model with Multiple Inputs (3 actually) to predict one single output. Specifically, my 3 inputs are: Actors Plot Summary Relevant Movie Features Output : Genre tags All the above inputs and the single output are related to 10,000 IMDB Movies. Even though the creation of the model is successful, when I try to fit the model on my three different X_train's I get an Attribute error. I have one X_train and X_test for actors, a different one X_train and X

how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

回眸只為那壹抹淺笑 提交于 2020-01-24 20:52:14
问题 TfidfVectorizer provides an easy way to encode & transform texts into vectors. My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf? update: Maybe I should have put more details on the question: What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out) 回答1: If you are, for instance,

How to detokenize spacy text without doc context?

夙愿已清 提交于 2020-01-24 18:03:08
问题 I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder. The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text. Example: Input to Seq2Seq: Some text Output from Seq2Seq: This does n't work . Is there any API in spacy to reverse tokenization done by rules in its tokenizer? 回答1: TL;DR I've written a code that attempts to do it, the snippet is below. Another approach, with a

What is used to train a self-attention mechanism?

徘徊边缘 提交于 2020-01-24 16:14:27
问题 I've been trying to understand self-attention, but everything I found doesn't explain the concept on a high level very well. Let's say we use self-attention in a NLP task, so our input is a sentence. Then self-attention can be used to measure how "important" each word in the sentence is for every other word. The problem is that I do not understand how that "importance" is measured. Important for what? What exactly is the goal vector the weights in the self-attention algorithm are trained