nlp | 易学教程

Is there an algorithm that tells the semantic similarity of two phrases

阅读更多关于 Is there an algorithm that tells the semantic similarity of two phrases

问题 input: phrase 1, phrase 2 output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing 回答1: You might want to check out this paper: Sentence similarity based on semantic nets and corpus statistics (PDF) I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so

Need to split #tags to text

阅读更多关于 Need to split #tags to text

问题 I Need to split #tags to meaningful words in an automated way. Sample Input: iloveusa mycrushlike mydadhero Sample Output i love usa my crush like my dad hero Any utility or open API that I can use to achieve this? 回答1: Check - Word Segmentation Task from Norvig's work. from __future__ import division from collections import Counter import re, nltk WORDS = nltk.corpus.brown.words() COUNTS = Counter(WORDS) def pdist(counter): "Make a probability distribution, given evidence from a Counter." N

爬虫小案例：联想词汇搜索

阅读更多关于爬虫小案例：联想词汇搜索

数据来源： http://ictclas.nlpir.org/nlpir/ （一个很牛逼的网站，上面有非常多的处理语言的功能（如分词标注、情感分析、相关词汇））当然这个网站还有其他的功能，像“分词标注”，就是把你输入的文本切成一个个的词，并且把这些词的词性都标出来；还有“情感分析”的功能，就是分析你输入的文本里面“乐”、“恶”、“怒”、“哀”等情绪的占比是多少...... 可是，这些功能的意义在哪呢？在人工智能领域里，有一个很重要的领域，叫自然语言处理（NLP）。NLP致力于让计算机听懂人的话，理解人的话，在此基础上，人与计算机才有对话的可能。而这个处理语言的网站的主要功能（如分词标注、情感分析、关键词提取、相关词汇等），就是NLP中的核心的底层技术。我们所理解的siri、小爱同学、微软小冰，这些可以和人交流的对话系统，也是建构在NLP之上的。无论最后建成的大楼有多么宏伟，都不可缺少坚实的地基。而对词语的基本处理，就是人工智能的一种“地基”，所以大家不要小觑这个网站中对语言处理的基本功能。 import requests url = 'http://ictclas.nlpir.org/nlpir/index6/getWord2Vec.do' headers = { 'referer':'http://ictclas.nlpir.org/nlpir/', 'user

Convert GMM-UBM scores to equicalent accuracy percent

阅读更多关于 Convert GMM-UBM scores to equicalent accuracy percent

问题 I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please? 回答1: There is no straightforward formula. You can do simple things like prob = exp(logratio_score) but those might not reflect the true distribution of your data. The computed probability percentage of your samples

NoClassDefFoundError: opennlp/tools/chunker/ChunkerModel

阅读更多关于 NoClassDefFoundError: opennlp/tools/chunker/ChunkerModel

问题 Got this error while trying opennlp chunking: NoClassDefFoundError: opennlp/tools/chunker/ChunkerModel Here is the basic code: import java.io.*; import opennlp.tools.chunker.*; public class test{ public static void main(String[] args) throws IOException{ ChunkerModel model = null; InputStream modelIn = new FileInputStream("en-parser-chunking.bin"); model = new ChunkerModel(modelIn); } } 回答1: I don't see any NLP-specific reasons here, so just check tutorials about NoClassDefFoundError, for

opennlp sample training data for disease

阅读更多关于 opennlp sample training data for disease

问题 I'm using OpenNLP for data classification. I could not find TokenNameFinderModel for disease here. I know I can create my own model but I was wondering is there any large sample training data available for disease? 回答1: You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model. you can find some help using modelbuilder addon here. It is basically, you put all the information in a text file and the NER

Data mining termin “fledged”?

阅读更多关于 Data mining termin “fledged”?

问题 Please tell what is termin "full fledged KI"? As i understand it is part of data mining for text analyzing. Am i right? Some interesting and useful links will be fine! Thank you!!! 回答1: By "full fledged", he likely means "fully fledged", defined as developed or matured to the fullest degree of full rank or status source: thefreedictionary.com Not sure about KI, but possibly it means: http://en.wikipedia.org/wiki/Knowledge_integration 回答2: My guess is that it is a typo of AI or a near-synonym,

Extracting Numbers Based On the Following Term in a String

阅读更多关于 Extracting Numbers Based On the Following Term in a String

问题 I have a batch of data that includes a text variable full of free-form text. I am trying to extract certain information based on context within the string into new variables which I can then analyze. I have been digging into qdap and tm . I have uniformed the format with tolower and replace_abbreviation but cannot seem to figure out how to actually extract the information I need. So for example, library(data.table) data<-data.table(text=c("Person 1: $1000 fine, 31 months jail", "Person 2:

Is 100 training examples sufficient for training custom NER using spacy? [closed]

阅读更多关于 Is 100 training examples sufficient for training custom NER using spacy? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 8 months ago . I have trained NER model for names data. I generated some random sentences which contain names of the person. I generated some 70 sentences and annotated the data in spacy's format. I trained custom NER using both blank 'en' model and 'en_core_web_sm' but when I tested on any

Convert nl string to vector or some numeric equivalent

阅读更多关于 Convert nl string to vector or some numeric equivalent

问题 I'm trying to convert a string to a numeric equivalent so I can train a neural-network to classify the strings. I tried the sum of the ascii values, but that just results in larger numbers vs smaller numbers. For example, I could have a short string in german and it puts it into the english class because the english words that it has been trained with are short and numerically small. I was looking into Google's word2vec, which seems like it should work. But I want to do this on the client