nlp | 易学教程

how do I use a very large (>2M) word embedding in tensorflow?

阅读更多关于 how do I use a very large (>2M) word embedding in tensorflow?

问题 I am running a model with a very big word embedding (>2M words). When I use tf.embedding_lookup, it expects the matrix, which is big. When I run, I subsequently get out of GPU memory error. If I reduce the size of the embedding, everything works fine. Is there a way to deal with larger embedding? 回答1: The recommended way is to use a partitioner to shard this large tensor across several parts: embedding = tf.get_variable("embedding", [1000000000, 20], partitioner=tf.fixed_size_partitioner(3))

How to get n-gram collocations and association in python nltk?

阅读更多关于 How to get n-gram collocations and association in python nltk?

问题 In this documentation, there is example using nltk.collocations.BigramAssocMeasures() , BigramCollocationFinder , nltk.collocations.TrigramAssocMeasures() , and TrigramCollocationFinder . There is example method find nbest based on pmi for bigram and trigram. example: finder = BigramCollocationFinder.from_words( ... nltk.corpus.genesis.words('english-web.txt')) >>> finder.nbest(bigram_measures.pmi, 10) I know that BigramCollocationFinder and TrigramCollocationFinder inherit from

Natural Language Processing in Java (NLP) [duplicate]

阅读更多关于 Natural Language Processing in Java (NLP) [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Java : Is there a good natural language processing library Can anybody tell me about a library for NLP in java? It would really be nice if is properly documented too. I have tried to work with lingpipe but I am not able to understand it completely. 回答1: You should try the Stanford - NLP. It has many utilities and libraries for NLP like the Parts-Of-Speech Tagger,all of which are great to use and easy to

Using my own corpus for category classification in Python NLTK

阅读更多关于 Using my own corpus for category classification in Python NLTK

问题 I'm a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text? >>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt') >>> len(reader.categories()) 234 回答1: Assuming you want a naive Bayes classifier with bag of words features: from nltk import FreqDist from nltk.classify

Using my own corpus for category classification in Python NLTK

阅读更多关于 Using my own corpus for category classification in Python NLTK

How to load sentences into Python gensim?

阅读更多关于 How to load sentences into Python gensim?

问题 I am trying to use the word2vec module from gensim natural language processing library in Python. The docs say to initialize the model: from gensim.models import word2vec model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) What format does gensim expect for the input sentences? I have raw text "the quick brown fox jumps over the lazy dogs" "Then a cop quizzed Mick Jagger's ex-wives briefly." etc. What additional processing do I need to post into word2fec ? UPDATE: Here is

Python re.split() vs nltk word_tokenize and sent_tokenize

阅读更多关于 Python re.split() vs nltk word_tokenize and sent_tokenize

问题 I was going through this question. Am just wondering whether NLTK would be faster than regex in word/sentence tokenization. 回答1: The default nltk.word_tokenize() is using the Treebank tokenizer that emulates the tokenizer from the Penn Treebank tokenizer. Do note that str.split() doesn't achieve tokens in the linguistics sense, e.g.: >>> sent = "This is a foo, bar sentence." >>> sent.split() ['This', 'is', 'a', 'foo,', 'bar', 'sentence.'] >>> from nltk import word_tokenize >>> word_tokenize

Parser for Wikipedia

阅读更多关于 Parser for Wikipedia

问题 I downloaded a Wikipedia dump and I want to convert the wiki format into my object format. Is there a wiki parser available that converts the object into XML? 回答1: See java-wikipedia-parser. I have never used it but according to the docs : The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the be.devijver.wikipedia.Visitor interface. 回答2: I do not know how exactly looks xml format of Wikipedia dump. But, if

Tools for getting intent from Twitter statuses?

阅读更多关于 Tools for getting intent from Twitter statuses?

问题 I am considering a project in which a publication's content is augmented by relevant, publicly available tweets from people in the area. But how could I programmatically find the relevant Tweets? I know that generating a structure representing the meaning of natural language is pretty much the holy grail of NLP, but perhaps there's some tool I can use to at least narrow it down a bit? Alternatively, I could just use hashtags. But that requires more work on behalf of the users. I'm not super

Determining tense of a sentence Python

阅读更多关于 Determining tense of a sentence Python

问题 Following several other posts, [e.g. Detect English verb tenses using NLTK , Identifying verb tenses in python, Python NLTK figure out tense ] I wrote the following code to determine tense of a sentence in Python using POS tagging: from nltk import word_tokenize, pos_tag def determine_tense_input(sentence): text = word_tokenize(sentence) tagged = pos_tag(text) tense = {} tense["future"] = len([word for word in tagged if word[1] == "MD"]) tense["present"] = len([word for word in tagged if word