nlp | 易学教程

what is the minimum dataset size needed for good performance with doc2vec?

阅读更多关于 what is the minimum dataset size needed for good performance with doc2vec?

问题 How does doc2vec perform when trained on different sized datasets? There is no mention of dataset size in the original corpus, so I am wondering what is the minimum size required to get good performance out of doc2vec. 回答1: A bunch of things have been called 'doc2vec', but it seems to most-often refer to the 'Paragraph Vector' technique from Le and Mikolov. The original 'Paragraph Vector' paper describes evaluating it on three datasets: 'Stanford Sentiment Treebank': 11,825 sentences of movie

How to extract chunks from BIO chunked sentences? - python

阅读更多关于 How to extract chunks from BIO chunked sentences? - python

问题 Give an input sentence, that has BIO chunk tags: [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')] I would need to extract the relevant phrases out, e.g. if I want to extract 'NP' , I would need to extract the fragments of tuples that contains B-NP and I-NP . [out]: [('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')] (Note: the numbers in the extract tuples

How should I vectorize the following list of lists with scikit learn?

阅读更多关于 How should I vectorize the following list of lists with scikit learn?

问题 I would like to vectorize with scikit learn a list who has lists. I go to the path where I have the training texts I read them and then I obtain something like this: corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]] from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(analyzer='word') vect_representation= vect.fit_transform(corpus) print vect_representation.toarray() And I get the following: return lambda x: strip_accents(x

Detect/Parse Mailing Addresses in Text

阅读更多关于 Detect/Parse Mailing Addresses in Text

问题 Are there any open source/commercial libraries out there that can detect mailing addresses in text, just like how Apple's Mail app underlines addresses on the Mac/iPhone. I've been doing a little online research and the ideas seem to be either to use Google, Regex or a full on NLP package such as Stanford's NLP, which usually are pretty massive. I doubt iPhone has a 500MB NLP package in there, or connects to Google every time you read an email. Which makes me to believe there should be an

Triple extraction from a sentance

阅读更多关于 Triple extraction from a sentance

问题 I have this parsed text in this format, I got it by using Standford nlp. (ROOT (S (NP (DT A) (NN passenger) (NN plane)) (VP (VBZ has) (VP (VBD crashed) (ADVP (RB shortly)) (PP (IN after) (NP (NP (NN take-off)) (PP (IN from) (NP (NNP Kyrgyzstan) (`` `) (NNP scapital) (, ,) (NNP Bishkek))))) (, ,) (VP (VBG killing) (NP (NP (DT a) (JJ large) (NN number)) (PP (IN of) (NP (NP (DT those)) (PP (IN on) (NP (NN board))))))))) (. .))) det(plane-3, A-1) nn(plane-3, passenger-2) nsubj(crashed-5, plane-3)

Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?

阅读更多关于 Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?

问题 I'm trying to perform a dictionary-based NER on some documents. My dictionary, regardless of the datatype, consists of key-value pairs of strings. I want to search for all the keys in the document, and return the corresponding value for that key whenever a match occurs. The problem is, my dictionary is fairly large: ~7 million key-values - average length of keys: 8 and average length of values: 20 characters. I've tried LingPipe with MapDictionary but on my desired environment setup, it runs

Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?

阅读更多关于 Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?

Reddit大型求助现场：用机器学习去实现通用人工智能，白日梦！

阅读更多关于 Reddit大型求助现场：用机器学习去实现通用人工智能，白日梦！

　　来源：reddit 　　<strong><strong>【新智元导读】</strong>通用人工智能（AGI）一直是人工智能学科的核心目标，但现在我们离真正的 AGI 还很远。今天的 Reddit 最热帖是一个关于 AGI 的讨论，发帖人的梦想是创建 AGI，但却被派去处理 NLP 领域的 ML 问题。他发现自己非常厌倦机器学习，于是热心的网友纷纷为他支招。戳右边链接上 <a data-miniprogram-app="" data-miniprogram-path="pages/group-detail/index" data-miniprogram-nickname="新智元" data-miniprogram-type="text" data-miniprogram-servicetype="">新智元小程序</a> 了解更多！</strong> 　　人工智能学科的核心目标是，有朝一日我们能够建造像人类一样聪明的机器。这样的系统通常被称为通用人工智能系统（AGI）。　　到目前为止，我们已经建立了无数 AI 系统，在特定任务中的表现可以超过人类，但是当涉及到一般的脑力活动时，目前还没有一个 AI 系统能够比得上老鼠，更别说超过人类了。　　今天的 Reddit 最热帖就是一个关于 AGI 的讨论。一位名叫“u/bguerra91”的发起者在大学里研究，在过去的几周里

Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

阅读更多关于 Setting NLTK with Stanford NLP (both StanfordNERTagger and StanfordPOSTagger) for Spanish

问题 The NLTK documentation is rather poor in this integration. The steps I followed were: Download http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip to /home/me/stanford Download http://nlp.stanford.edu/software/stanford-spanish-corenlp-2015-01-08-models.jar to /home/me/stanford Then in a ipython console: In [11]: import nltk In [12]: nltk.__version__ Out[12]: '3.1' In [13]: from nltk.tag import StanfordNERTagger Then st = StanfordNERTagger('/home/me/stanford/stanford

Sentence segmentation tools to use when input sentence has no punctuation (is normalized)

阅读更多关于 Sentence segmentation tools to use when input sentence has no punctuation (is normalized)

问题 Suppose there is a sentence like "find me some jazz music and play it", where all the text is normalized and there are no punctuation marks (output of a speech recognition library). What online/offline tools can be used to do "sentence segmentation" other than the naive approach of splitting on conjunctions ? Input: find me some jazz music and play it Output: find me some jazz music play it 回答1: A dependence parser should help. 回答2: You can use a semantic role tagger like mate tools etc...