nlp | 易学教程

Is it possible to use spacy with already tokenized input?

阅读更多关于 Is it possible to use spacy with already tokenized input?

问题 I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don't want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ? Here is an example about my question: # I know that it does the following

What's the difference between WordNet 3.1 and WordNet 3.0?

阅读更多关于 What's the difference between WordNet 3.1 and WordNet 3.0?

问题 There doesn't seem to be a changelog or something of that sort available at wordnet.princeton.edu 回答1: To add to @abarisone's answer, the actual synset IDs themselves can differ between WordNet 3.0 and WordNet 3.1 :( For example, in WordNet 3.1 a chair is 103005231-n. However, in WordNet 3.0 it was 103001627-n . But you cannot look that up in http://wordnet-rdf.princeton.edu/wn31/103001627-n nor http://wordnet-rdf.princeton.edu/wn30/103001627-n, but instead you need to use http://wordnet-rdf

“ImportError: cannot import name StanfordNERTagger” in NLTK

阅读更多关于 “ImportError: cannot import name StanfordNERTagger” in NLTK

问题 I'm unable to import the NER Stanford Tagger in NLTK. This is what I have done: Downloaded the java code from here and added an environment variable STANFORD_MODELS with the path to the folder where the java code is stored. That should be sufficient according to the information that is provided on the NLTK site. It says: "Tagger models need to be downloaded from http://nlp.stanford.edu/software and the STANFORD_MODELS environment variable set (a colon-separated list of paths)." Would anybody

How to print the parse tree of Stanford JavaNLP

阅读更多关于 How to print the parse tree of Stanford JavaNLP

问题 I am trying to get all the noun phrases using the edu.stanford.nlp.* package. I got all the subtrees of label value "NP", but I am not able to get the normal original String format (not Penn Tree format). E.g. for the subtree.toString() gives (NP (ND all)(NSS times))) but I want the string "all times". Can anyone please help me. Thanks in advance. 回答1: I believe what you want is something like: final StringBuilder sb = new StringBuilder(); for ( final Tree t : tree.getLeaves() ) { sb.append(t

Working with large text files in R to create n-grams

阅读更多关于 Working with large text files in R to create n-grams

问题 I am trying to create trigrams and bigrams from a large (1GB) text file using the 'quanteda' package in the R programming environment. If I try and run my code in one go (as below) R just hangs (on the 3rd line - myCorpus<-toLower(...)). I used the code successfully on a small dataset <1mb, so I guess the file is too large. I can see I perhaps need to load the text in 'chunks' and combine the resulting frequencies of bigrams and trigrams afterwards. But I cannot work out how to load and

How to translate words in NTLK swadesh corpus regardless of case - python

阅读更多关于 How to translate words in NTLK swadesh corpus regardless of case - python

问题 I'm new to python and natural language processing, and I'm trying to learn using the nltk book. I'm doing the exercises at the end of Chapter 2, and there is a question I'm stuck on. "In the discussion of comparative wordlists, we created an object called translate which you could look up using words in both German and Italian in order to get corresponding words in English. What problem might arise with this approach? Can you suggest a way to avoid this problem?" The book had me use the

How to use Mallet for NER [closed]

阅读更多关于 How to use Mallet for NER [closed]

问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . I'm new to the subject of NLP and requested to perform -named entity recognition- (NER) using Mallet. I have a text, and I give feature vector for each word in it. I would like to train a model which later on I

Fuzzy sentence search algorithms

阅读更多关于 Fuzzy sentence search algorithms

问题 Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good

Figuring out where to add punctuation in bad user generated content?

阅读更多关于 Figuring out where to add punctuation in bad user generated content?

问题 Is there a way to use NLP or an existing library to add missing punctuation to bad user generated content? For example, this string: Today is Tuesday I went to work on Monday Friday was off would become: Today is Tuesday. I went to work on Monday. Friday was off. 回答1: I think this problem falls under sentence boundary disambiguation http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation. I have used OpenNLP variant and was satisfied with the results. 回答2: I've played briefly with this

Stanford Universal Dependencies on Python NLTK

阅读更多关于 Stanford Universal Dependencies on Python NLTK

问题 Is there any way I can get the Universal dependencies using python, or nltk?I can only produce the parse tree. Example: Input sentence: My dog also likes eating sausage. Output: Universal dependencies nmod:poss(dog-2, My-1) nsubj(likes-4, dog-2) advmod(likes-4, also-3) root(ROOT-0, likes-4) xcomp(likes-4, eating-5) dobj(eating-5, sausage-6) 回答1: Wordseer's stanford-corenlp-python fork is a good start as it works with the recent CoreNLP release (3.5.2). However it will give you raw output,