nlp

Earley cannot handle epsilon-states already contained in chart

天涯浪子 提交于 2019-12-12 05:33:38
问题 I have implemented the Earley parser using a queue to process states. The queue is seeded with the top-level rule. For each state in the queue, one of the operations (prediction, scanning, completion) is performed by adding new states to the queue. Duplicate states are not added. The problem I am having is best described with the following grammar: When parsing A , the following happens: As you can tell, A will not be fully resolved . This is because the completion with the epsilon state will

How much text can Weka handle?

风流意气都作罢 提交于 2019-12-12 04:58:20
问题 I have a sentiment analysis task and I need to specify how much data (in my case text) weka can handle. I have a corpus of 2500 opinions already tagged. I know that it´s a small corpus but my thesis advisor is asking me to specifically argue on how much data can Weka handle. 回答1: Your limitation with Weka will be on whatever learning algorithm you use and how much memory you have available for training. Most classifiers require the whole set be loaded into memory for training, but there are

Python: clustering similar words based on word2vec

一曲冷凌霜 提交于 2019-12-12 04:54:20
问题 This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1") site.download() site.parse() def clean(doc): stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop]) punc_free = ''.join(ch for ch in stop_free if ch not in exclude) normalized = " ".join(lemma.lemmatize(word)

Stanford POS tagger with GATE twitter model is slow

瘦欲@ 提交于 2019-12-12 04:54:01
问题 I am using the Stanford POS tagger with the GATE Twitter model and the tagger takes around 3 seconds to initialize, is this normal or am I loading it incorrectly? Small sample code: package tweet.nlp.test; import edu.stanford.nlp.tagger.maxent.MaxentTagger; public class TweetNLPTest { public static void main(String[] args) { String text = "My sister won't tell me where she hid my food. She's fueling my anorexia. #bestsisteraward #not 😭💀"; MaxentTagger tagger = new MaxentTagger("models/gate-EN

German corenlp model defaulting to english models

让人想犯罪 __ 提交于 2019-12-12 04:34:14
问题 I use the following command to serve a corenlp server for German language models which are downloaded as jar in the classpath , but it does not output german tags or parse but loads only english models: java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -props ./german.prop german.prop contents: annotators = tokenize, ssplit, pos, depparse, parse tokenize.language = de pos.model = edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger ner.model = edu/stanford/nlp/models

Extract wikipedia articles belonging to a category from offline dumps

為{幸葍}努か 提交于 2019-12-12 04:25:11
问题 I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography) I could get a lot of similar questions for example: Wikipedia API to get articles belonging to a category How do I get all articles about people from Wikipedia? However, I would like to do it all offline. That is using dumps, and also for different languages. Other things which I explored are category table and category link table.

NLP Recurrent Neural Network always gives constant values

不想你离开。 提交于 2019-12-12 04:22:28
问题 I've written a simple recurrent network in TensorFlow based on this video that I watched: https://youtu.be/vq2nnJ4g6N0?t=8546 In the video the RNN is demonstrated to produce Shakespeare plays by having the network produce words one character at a time. The output of the network is fed back into the input on the next iteration. Here's a diagram of my network: +--------------------------------+ | | | In: H E L L O W O R L <--+-----+ | | | | | | | | | | | | | | V V V V V V V V V V | | Recursive

How to apply grepl for data frame

99封情书 提交于 2019-12-12 03:38:10
问题 I want to use grepl for multiple patterns defined as a data frame. df_sen is presented as sentence "She would like to go there" "I had it few days ago" "We have spent few millions" df_triggers is presented as follows: trigger few days few millions And I want to create a matrix where sentence x triggers and on the intersection to see 1 if trigger was found in a sentence and 0 if it was not. I have tried to do it like this: matrix <- grepl(df_triggers$trigger, df_sen$sentence) But I see the

What's a simple way to efficiently find specific terms or phrases within a short unknown string?

大城市里の小女人 提交于 2019-12-12 03:37:56
问题 Working on a twitterfeed visualization. I have a big dataset. I only want to use tweet messages that contain specific strings of words. I now have this line: data = data.filter(function(d, i) { return d.text.indexOf('new year')!=-1 ? true : false;}); It returns all the tweets in a twitterfeed that contain the string 'new year' . Works fine! :) But how do I select multiple strings? Actually, I want this piece to also return the tweets that contain variations like 'newyear' and/or 'happy new

use perl to extract specific output lines

浪尽此生 提交于 2019-12-12 03:16:21
问题 I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance: $ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n To generate output of the form: 1 stdin 2 1 3 Bananas 4 are an excellent source of 5 potassium 6 0 7 1 8 1 9 6 10 6 11 7 12 0.9999999997341693 13 Bananas are an excellent source of potassium . 14 NNS VBP DT JJ NN IN NN . 15 B-NP B-VP