nlp | 易学教程

Earley cannot handle epsilon-states already contained in chart

阅读更多关于 Earley cannot handle epsilon-states already contained in chart

问题 I have implemented the Earley parser using a queue to process states. The queue is seeded with the top-level rule. For each state in the queue, one of the operations (prediction, scanning, completion) is performed by adding new states to the queue. Duplicate states are not added. The problem I am having is best described with the following grammar: When parsing A , the following happens: As you can tell, A will not be fully resolved . This is because the completion with the epsilon state will

How much text can Weka handle?

阅读更多关于 How much text can Weka handle?

问题 I have a sentiment analysis task and I need to specify how much data (in my case text) weka can handle. I have a corpus of 2500 opinions already tagged. I know that it´s a small corpus but my thesis advisor is asking me to specifically argue on how much data can Weka handle. 回答1: Your limitation with Weka will be on whatever learning algorithm you use and how much memory you have available for training. Most classifiers require the whole set be loaded into memory for training, but there are

Python: clustering similar words based on word2vec

阅读更多关于 Python: clustering similar words based on word2vec

问题 This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1") site.download() site.parse() def clean(doc): stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop]) punc_free = ''.join(ch for ch in stop_free if ch not in exclude) normalized = " ".join(lemma.lemmatize(word)

Stanford POS tagger with GATE twitter model is slow

阅读更多关于 Stanford POS tagger with GATE twitter model is slow

问题 I am using the Stanford POS tagger with the GATE Twitter model and the tagger takes around 3 seconds to initialize, is this normal or am I loading it incorrectly? Small sample code: package tweet.nlp.test; import edu.stanford.nlp.tagger.maxent.MaxentTagger; public class TweetNLPTest { public static void main(String[] args) { String text = "My sister won't tell me where she hid my food. She's fueling my anorexia. #bestsisteraward #not 😭💀"; MaxentTagger tagger = new MaxentTagger("models/gate-EN

German corenlp model defaulting to english models

阅读更多关于 German corenlp model defaulting to english models

问题 I use the following command to serve a corenlp server for German language models which are downloaded as jar in the classpath , but it does not output german tags or parse but loads only english models: java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -props ./german.prop german.prop contents: annotators = tokenize, ssplit, pos, depparse, parse tokenize.language = de pos.model = edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger ner.model = edu/stanford/nlp/models

Extract wikipedia articles belonging to a category from offline dumps

阅读更多关于 Extract wikipedia articles belonging to a category from offline dumps

问题 I have wikipedia article dumps in different languages. I want to filter them with articles which belong to a category(specifically Category:WikiProject_Biography) I could get a lot of similar questions for example: Wikipedia API to get articles belonging to a category How do I get all articles about people from Wikipedia? However, I would like to do it all offline. That is using dumps, and also for different languages. Other things which I explored are category table and category link table.

NLP Recurrent Neural Network always gives constant values

阅读更多关于 NLP Recurrent Neural Network always gives constant values

问题 I've written a simple recurrent network in TensorFlow based on this video that I watched: https://youtu.be/vq2nnJ4g6N0?t=8546 In the video the RNN is demonstrated to produce Shakespeare plays by having the network produce words one character at a time. The output of the network is fed back into the input on the next iteration. Here's a diagram of my network: +--------------------------------+ | | | In: H E L L O W O R L <--+-----+ | | | | | | | | | | | | | | V V V V V V V V V V | | Recursive

How to apply grepl for data frame

阅读更多关于 How to apply grepl for data frame

问题 I want to use grepl for multiple patterns defined as a data frame. df_sen is presented as sentence "She would like to go there" "I had it few days ago" "We have spent few millions" df_triggers is presented as follows: trigger few days few millions And I want to create a matrix where sentence x triggers and on the intersection to see 1 if trigger was found in a sentence and 0 if it was not. I have tried to do it like this: matrix <- grepl(df_triggers$trigger, df_sen$sentence) But I see the

What's a simple way to efficiently find specific terms or phrases within a short unknown string?

阅读更多关于 What's a simple way to efficiently find specific terms or phrases within a short unknown string?

问题 Working on a twitterfeed visualization. I have a big dataset. I only want to use tweet messages that contain specific strings of words. I now have this line: data = data.filter(function(d, i) { return d.text.indexOf('new year')!=-1 ? true : false;}); It returns all the tweets in a twitterfeed that contain the string 'new year' . Works fine! :) But how do I select multiple strings? Actually, I want this piece to also return the tweets that contain variations like 'newyear' and/or 'happy new

use perl to extract specific output lines

阅读更多关于 use perl to extract specific output lines

问题 I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance: $ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n To generate output of the form: 1 stdin 2 1 3 Bananas 4 are an excellent source of 5 potassium 6 0 7 1 8 1 9 6 10 6 11 7 12 0.9999999997341693 13 Bananas are an excellent source of potassium . 14 NNS VBP DT JJ NN IN NN . 15 B-NP B-VP