nlp | 易学教程

Nltk stanford pos tagger error : Java command failed

阅读更多关于 Nltk stanford pos tagger error : Java command failed

问题 I'm trying to use nltk.tag.stanford module for tagging a sentence (first like wiki's example) but i keep getting the following error : Traceback (most recent call last): File "test.py", line 28, in <module> print st.tag(word_tokenize('What is the airspeed of an unladen swallow ?')) File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 59, in tag return self.tag_sents([tokens])[0] File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 81, in tag_sents

A Viable Solution for Word Splitting Khmer?

阅读更多关于 A Viable Solution for Word Splitting Khmer?

问题 I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside. Here is a sample line of Khmer that needs to be split (they can be longer than this): ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ

Is possible to keep spacy in memory to reduce the load time? [closed]

阅读更多关于 Is possible to keep spacy in memory to reduce the load time? [closed]

问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 2 years ago . I want to use spacy as for NLP for an online service. Each time a user makes a request I call the script "my_script.py" which starts with: from spacy.en import English nlp = English() The problem I'm having is that those two lines take over 10 seconds, is it possible to keep English() in the ram or

What is the connection or difference between lemma and synset in wordnet?

阅读更多关于 What is the connection or difference between lemma and synset in wordnet?

问题 I am a complete beginner to NLP and NLTK. I was not able to understand the exact difference between lemmas and synsets in wordnet , because both are producing nearly the same output. for example for the word cake it produce this output. lemmas : [Lemma('cake.n.01.cake'), Lemma('patty.n.01.cake'), Lemma('cake.n.03.cake'), Lemma('coat.v.03.cake')] synsets : [Synset('cake.n.01'), Synset('patty.n.01'), Synset('cake.n.03'), Synset('coat.v.03')] please help me to understand this concept. Thank you.

Methods for Geotagging or Geolabelling Text Content

阅读更多关于 Methods for Geotagging or Geolabelling Text Content

问题 What are some good algorithms for automatically labeling text with the city / region or origin? That is, if a blog is about New York, how can I tell programatically. Are there packages / papers that claim to do this with any degree of certainty? I have looked at some tfidf based approaches, proper noun intersections, but so far, no spectacular successes, and I'd appreciate ideas! The more general question is about assigning texts to topics, given some list of topics. Simple / naive approaches

SpaCy: how to load Google news word2vec vectors?

阅读更多关于 SpaCy: how to load Google news word2vec vectors?

问题 I've tried several methods of loading the google news word2vec vectors (https://code.google.com/archive/p/word2vec/): en_nlp = spacy.load('en',vector=False) en_nlp.vocab.load_vectors_from_bin_loc('GoogleNews-vectors-negative300.bin') The above gives: MemoryError: Error assigning 18446744072820359357 bytes I've also tried with the .gz packed vectors; or by loading and saving them with gensim to a new format: from gensim.models.word2vec import Word2Vec model = Word2Vec.load_word2vec_format(

SpaCy: how to load Google news word2vec vectors?

阅读更多关于 SpaCy: how to load Google news word2vec vectors?

Efficient Context-Free Grammar parser, preferably Python-friendly

阅读更多关于 Efficient Context-Free Grammar parser, preferably Python-friendly

问题 I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently . Right now I'm using NLTK's parser which produces the right output but is very slow. For my grammar of ~450 fairly ambiguous non-lexicon rules and half a million lexical entries, parsing simple sentences can take anywhere from 2 to 30 seconds, depending it seems on the number of resulting trees. Lexical

Feature Selection and Reduction for Text Classification

阅读更多关于 Feature Selection and Reduction for Text Classification

问题 I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases . I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features , an elimination is done due to a threshold value of frequency of occurrence . The final set of features includes around 20.000 features, which is actually a 90% decrease , but not enough for

PTB treebank from CoNLL-X

阅读更多关于 PTB treebank from CoNLL-X

问题 I have a CoNLL-X format treebank and the corresponding binary parse tree for each sentence and I want to convert it into a PTB format. Is there any converters or can anyone shed light on the PTB format? 回答1: There's been a number of efforts to convert from dependencies (representable in CoNLL-X format) to constituents (representable in Penn Treebank, or PTB, format). Two recent papers and their code: Transforming Dependencies into Phrase Structures (Kong, Rush, and Smith, NAACL 2015). Code.