nlp

how do I use a very large (>2M) word embedding in tensorflow?

痴心易碎 提交于 2019-12-20 19:57:10
问题 I am running a model with a very big word embedding (>2M words). When I use tf.embedding_lookup, it expects the matrix, which is big. When I run, I subsequently get out of GPU memory error. If I reduce the size of the embedding, everything works fine. Is there a way to deal with larger embedding? 回答1: The recommended way is to use a partitioner to shard this large tensor across several parts: embedding = tf.get_variable("embedding", [1000000000, 20], partitioner=tf.fixed_size_partitioner(3))

How to get n-gram collocations and association in python nltk?

你离开我真会死。 提交于 2019-12-20 15:35:56
问题 In this documentation, there is example using nltk.collocations.BigramAssocMeasures() , BigramCollocationFinder , nltk.collocations.TrigramAssocMeasures() , and TrigramCollocationFinder . There is example method find nbest based on pmi for bigram and trigram. example: finder = BigramCollocationFinder.from_words( ... nltk.corpus.genesis.words('english-web.txt')) >>> finder.nbest(bigram_measures.pmi, 10) I know that BigramCollocationFinder and TrigramCollocationFinder inherit from

Natural Language Processing in Java (NLP) [duplicate]

爷,独闯天下 提交于 2019-12-20 15:29:13
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Java : Is there a good natural language processing library Can anybody tell me about a library for NLP in java? It would really be nice if is properly documented too. I have tried to work with lingpipe but I am not able to understand it completely. 回答1: You should try the Stanford - NLP. It has many utilities and libraries for NLP like the Parts-Of-Speech Tagger,all of which are great to use and easy to

Using my own corpus for category classification in Python NLTK

主宰稳场 提交于 2019-12-20 14:09:36
问题 I'm a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text? >>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt') >>> len(reader.categories()) 234 回答1: Assuming you want a naive Bayes classifier with bag of words features: from nltk import FreqDist from nltk.classify

Using my own corpus for category classification in Python NLTK

梦想的初衷 提交于 2019-12-20 14:09:12
问题 I'm a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text? >>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt') >>> len(reader.categories()) 234 回答1: Assuming you want a naive Bayes classifier with bag of words features: from nltk import FreqDist from nltk.classify

How to load sentences into Python gensim?

萝らか妹 提交于 2019-12-20 12:37:24
问题 I am trying to use the word2vec module from gensim natural language processing library in Python. The docs say to initialize the model: from gensim.models import word2vec model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) What format does gensim expect for the input sentences? I have raw text "the quick brown fox jumps over the lazy dogs" "Then a cop quizzed Mick Jagger's ex-wives briefly." etc. What additional processing do I need to post into word2fec ? UPDATE: Here is

Python re.split() vs nltk word_tokenize and sent_tokenize

我只是一个虾纸丫 提交于 2019-12-20 12:36:02
问题 I was going through this question. Am just wondering whether NLTK would be faster than regex in word/sentence tokenization. 回答1: The default nltk.word_tokenize() is using the Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer. Do note that str.split() doesn't achieve tokens in the linguistics sense, e.g.: >>> sent = "This is a foo, bar sentence." >>> sent.split() ['This', 'is', 'a', 'foo,', 'bar', 'sentence.'] >>> from nltk import word_tokenize >>> word_tokenize

Parser for Wikipedia

 ̄綄美尐妖づ 提交于 2019-12-20 12:14:07
问题 I downloaded a Wikipedia dump and I want to convert the wiki format into my object format. Is there a wiki parser available that converts the object into XML? 回答1: See java-wikipedia-parser. I have never used it but according to the docs : The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the be.devijver.wikipedia.Visitor interface. 回答2: I do not know how exactly looks xml format of Wikipedia dump. But, if

Tools for getting intent from Twitter statuses?

牧云@^-^@ 提交于 2019-12-20 11:35:27
问题 I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit? Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super

Determining tense of a sentence Python

这一生的挚爱 提交于 2019-12-20 10:55:48
问题 Following several other posts, [e.g. Detect English verb tenses using NLTK , Identifying verb tenses in python, Python NLTK figure out tense ] I wrote the following code to determine tense of a sentence in Python using POS tagging: from nltk import word_tokenize, pos_tag def determine_tense_input(sentence): text = word_tokenize(sentence) tagged = pos_tag(text) tense = {} tense["future"] = len([word for word in tagged if word[1] == "MD"]) tense["present"] = len([word for word in tagged if word