nlp | 易学教程

Classification using movie review corpus in NLTK/Python

阅读更多关于 Classification using movie review corpus in NLTK/Python

问题 I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt ), but I would prefer to create directories I could dump files into. from nltk.corpus import movie_reviews

Code Golf: Number to Words

阅读更多关于 Code Golf: Number to Words

问题 Locked . This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions. The code golf series seem to be fairly popular. I ran across some code that converts a number to its word representation. Some examples would be (powers of 2 for programming fun): 2 -> Two 1024 -> One Thousand Twenty Four 1048576 -> One Million Forty Eight Thousand Five Hundred Seventy Six The algorithm my co-worker

Stemmers vs Lemmatizers

阅读更多关于 Stemmers vs Lemmatizers

问题 Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems. Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms. Stemmers [in]: having [out]: hav

NLTK Lemmatizer, Extract meaningful words

阅读更多关于 NLTK Lemmatizer, Extract meaningful words

问题 Currently, I am going to create a machine learning based code that automatically maps categories. I am going to do natural language processing before that. There are several words list. sent ='The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing '.lower().split() I made the following code. I referenced this url. NLTK: lemmatizer and pos_tag from nltk.tag import pos_tag from nltk.tokenize import word_tokenize from nltk.stem import

Getting corefrences with Standard corenlp package

阅读更多关于 Getting corefrences with Standard corenlp package

问题 I'm trying to get coreferences in a text. I'm new to the corenlp package. I tried the code below, which doesn't work, but I'm open to other methods as well. /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package corenlp; import edu.stanford.nlp.ling.CoreAnnotations.CollapsedCCProcessedDependenciesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.CorefGraphAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;

Mapping words to numbers with respect to definition

阅读更多关于 Mapping words to numbers with respect to definition

问题 As part of a larger project, I need to read in text and represent each word as a number. For example, if the program reads in " Every good boy deserves fruit ", then I would get a table that converts ' every ' to ' 1742 ', ' good ' to ' 977513 ', etc. Now, obviously I can just use a hashing algorithm to get these numbers. However, it would be more useful if words with similar meanings had numerical values close to each other, so that ' good ' becomes ' 6827 ' and ' great ' becomes ' 6835 ',

Train spaCy's existing POS tagger with my own training examples

阅读更多关于 Train spaCy's existing POS tagger with my own training examples

问题 I am trying to train the existing POS tagger on my own lexicon, not starting off from scratch (I do not want to create an "empty model"). In spaCy's documentation, it says "Load the model you want to stat with", and the next step is "Add the tag map to the tagger using add_label method". However, when I try to load the English small model, and add the tag map, it throws this error: ValueError: [T003] Resizing pre-trained Tagger models is not currently supported. I was wondering how it can be

How do I do use non-integer string labels with SVM from scikit-learn? Python

阅读更多关于 How do I do use non-integer string labels with SVM from scikit-learn? Python

问题 Scikit-learn has fairly user-friendly python modules for machine learning. I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]] , my tuples will look like this [['word','NOUN'], ['young', 'adjective']] Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for

Weka - Classifier returns the same distribution for any input

阅读更多关于 Weka - Classifier returns the same distribution for any input

问题 I'm trying to build a naive bayes classifier for classifying text between two classes. Everything works great in the GUI explorer, but when I try to recreate it in code, I get the same output no matter what input I try to classify. Within the code, I get the same evaluation metrics I get within the GUI (81% accuracy), but whenever I try to create a new instance and classify that, I get the same distributions for both classes no matter what input I use. Below is my code - its in scala, but is

How to extract quotations from text using NLTK [duplicate]

阅读更多关于 How to extract quotations from text using NLTK [duplicate]

问题 This question already has answers here : RegEx: Grabbing values between quotation marks (19 answers) Closed 3 years ago . I have a project wherein I need to extract quotations from a huge set of articles . Here , by quotations I mean things said by people , for eg: Alen said " text to be extracted ." I'm using NLTK for my other NLP related tasks so any solution using NLTK or any kind of Python library would be quite useful. Thanks 回答1: As Mayur mentioned, you can do a regex to pick up