nlp | 易学教程

NLTK was unable to find the gs file

阅读更多关于 NLTK was unable to find the gs file

问题 I'm trying to use NLTK, the stanford natural language toolkit. After install the required files, I start to execute the demo code: http://www.nltk.org/index.html >>> import nltk >>> sentence = """At eight o'clock on Thursday morning ... Arthur didn't feel very good.""" >>> tokens = nltk.word_tokenize(sentence) >>> tokens ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.'] >>> tagged = nltk.pos_tag(tokens) >>> tagged[0:6] [('At', 'IN'),

Naive Bayesian for Topic detection using “Bag of Words” approach

阅读更多关于 Naive Bayesian for Topic detection using “Bag of Words” approach

问题 I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ? Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence

NLP 工具类库

阅读更多关于 NLP 工具类库

NLPIR 　　http://www.nlpir.org/ HanLP 　　https://github.com/hankcs Apache OpenNLP https://opennlp.apache.org/ Apache UIMA 　　http://uima.apache.org/ LingPipe LingPipe 是一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能，包括主题分类（Top Classification）、命名实体识别（Named Entity Recognition）、词性标注（Part-of Speech Tagging）、句题检测（Sentence Detection）、查询拼写检查（Query Spell Checking）、兴趣短语检测（Interseting Phrase Detection）、聚类（Clustering）、字符语言建模（Character Language Modeling）、医学文献下载/解析/索引（MEDLINE Download, Parsing and Indexing）、数据库文本挖掘（Database Text Mining）、中文分词（Chinese Word Segmentation）、情感分析（Sentiment Analysis）、语言辨别（Language

How to determine the (natural) language of a document?

阅读更多关于 How to determine the (natural) language of a document?

问题 I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is Java Not free for "semi-commercial" usage

Updating the feature names into scikit TFIdfVectorizer

阅读更多关于 Updating the feature names into scikit TFIdfVectorizer

问题 I am trying out this code from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np train_data = ["football is the sport","gravity is the movie", "education is imporatant"] vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english') print "Applying first train data" X_train = vectorizer.fit_transform(train_data) print vectorizer.get_feature_names() print "\n\nApplying second train data" train_data = ["cricket", "Transformers is a film","AIMS is a

Updating the feature names into scikit TFIdfVectorizer

阅读更多关于 Updating the feature names into scikit TFIdfVectorizer

Determine if text is in English?

阅读更多关于 Determine if text is in English?

问题 I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true: [ "this is some text written in English", "this is some more text written in English", "Ce n'est pas en anglais" ] For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but

NLP 语义相似度计算整理总结

阅读更多关于 NLP 语义相似度计算整理总结

更新中更新时间： 2019-12-03 18:29:52 写在前面：本人是喜欢这个方向的学生一枚，写文的目的意在记录自己所学，梳理自己的思路，同时share给在这个方向上一起努力的同学。写得不够专业的地方望批评指正，欢迎感兴趣的同学一起交流进步。（参考文献在第四部分，侵删）一、背景二、基本概念三、语义相似度计算方法四、参考文献一、背景在很多NLP任务中，都涉及到语义相似度的计算，例如：在搜索场景下（对话系统、问答系统、推理等），query和Doc的语义相似度； feeds场景下Doc和Doc的语义相似度；在各种分类任务，翻译场景下，都会涉及到语义相似度语义相似度的计算。所以在学习的过程中，希望能够更系统的梳理一下这方面的方法。二、基本概念 1. TF Term frequency即关键词词频，是指一篇文章中关键词出现的频率，比如在一篇M个词的文章中有N个该关键词，则为该关键词在这篇文章中的词频。 2. IDF Inverse document frequency指逆向文本频率，是用于衡量关键词权重的指数，由公式计算而得，其中D为文章总数，Dw为关键词出现过的文章数。 3. 向量空间模型向量空间模型简称 VSM，是 VectorSpace Model 的缩写。在此模型中，文本被看作是由一系列相互独立的词语组成的，若文档 D 中包含词语 t1,t2,

原创：语义相似度(理论篇)

阅读更多关于原创：语义相似度(理论篇)

　　如果本文观点有不对的地方，欢迎指正！　author:佟学强　开场白：对于事物的理解，一般分3个层次：①看山是山，看水是水②看山不是山，看水不是水③看山是山，看水是水。对AI和nlp的理解，同样会有这三个层次。比如，刚毕业的硕士或者毕业1~2年的，会热衷于研究GAN，seq2seq，甚至包括nlp刚起步的一些公司。这类群体对nlp的理解处于第一层次，后面还有很多雷区要踩，要付出一定的试错代价，成长的代价。等到有一定的积累了，对nlp的理解有一定的理解功底了，会逐渐修正研究路线和方向，这个时候比第一阶段有更多的疑惑，因为随着研究的深入，发现nlp和图像机制存在很大的不同，不能照搬，认知智能好像不是那么容易，由感知智能到认知智能的跨越，是这一阶段的一大进步，这是第二个层次，各个派别有争论，看山不是山，看水不是水。最高境界，返璞归真，拥有行业20年及以上的研究人员，对nlp看的比较透，目前的Ai基本上陷入了统计建模，概率的漩涡之中，还不是真正的智能。仅仅从数据中挖掘线性关系还远远不够，应该让机器具有认知能力，挖掘因果关系。致力于推进nlp认知智能的进步，加大力度研究知识图谱，包括知识图谱的向量化，与深度学习的融合，让神经网络学习规则等等。可以这样说，目前从感知智能到认知智能的跨越，才刚刚开始，知识工程的复苏势不可挡。本人接触过许多刚入门的人，基本上对seq2seq和GAN比较狂热

Robitcs，CV，ASR，TTS，NLP，KG，CG是什么

阅读更多关于 Robitcs，CV，ASR，TTS，NLP，KG，CG是什么

有自己的身体（Robitcs），拥有自己的视觉（CV）、听觉（ASR）、可以说话唱歌（TTS），可以回答自然语言问题（NLP），有自己的意识（KG），形象和动作（CG）各种缩写写那么高端谁知道什么意思呀来源： CSDN 作者：再见沉沦小小施链接： https://blog.csdn.net/Lee_Shi/article/details/104027459

订阅 nlp