nlp | 易学教程

Converting list of strings with u'…' to a list of normal strings [duplicate]

阅读更多关于 Converting list of strings with u'…' to a list of normal strings [duplicate]

问题 This question already has answers here : What's the u prefix in a Python string? (6 answers) Closed 3 years ago . I'm a newbie in python. And apologies for a very basic question. I'm working with python pattern.en library and try to get the synonyms of a word. this is my code and is working fine. from pattern.en import wordnet a=wordnet.synsets('human') print a[0].synonyms this what the output i get from this: [u'homo', u'man', u'human being', u'human'] but for my program i need to insert

Add categorical variable(gender) to Sparse Matrix for Multiclass Classification using sklearn

阅读更多关于 Add categorical variable(gender) to Sparse Matrix for Multiclass Classification using sklearn

问题 I am building a multiclass classification model using sklearn. I am converting my tweets into a 571x1815 sparse matrix of type with 34737 stored elements in Compressed Sparse Row format. I am trying to predict age groups based on history of tweets but I want to add an exogenous categorical variable (gender) to my sparse matrix and they use either Decision Tree or Random Forest to do my prediction. How do I add a vector to a sparse matrix? def vectorize(df): bow_transformer = CountVectorizer

Extract Word from Synset using Wordnet in NLTK 3.0

阅读更多关于 Extract Word from Synset using Wordnet in NLTK 3.0

问题 Some time ago, someone on SO asked how to retrieve a list of words for a given synset using NLTK's wordnet wrapper. Here is one of the suggested responses: for synset in wn.synsets('dog'): print synset.lemmas[0].name Running this code with NLTK 3.0 yields TypeError: 'instancemethod' object is not subscriptable . I tried each of the previously-proposed solutions (each of the solutions described on the page linked above), but each throws an error. I therefore wanted to ask: Is it possible to

POS tagging using nltk takes time

阅读更多关于 POS tagging using nltk takes time

问题 I am trying to get POS tags using nltk, i think it should take less then or around 1 sencond for processing small text. But 2-3 sentence it takes 20-25 second. import nltk,re, time def findPos( text): start_time = time.time() try: tokens = nltk.word_tokenize(text) pos_tags = nltk.pos_tag(tokens) print [ x[0] for x in pos_tags if x[1] == "NN" or "NNP"] except Exception: import traceback traceback.format_exc() print("--- %s seconds ---" % (time.time() - start_time)) findPos(raw_input()) Any

NLP - Sentence Segmentation

阅读更多关于 NLP - Sentence Segmentation

问题 I am a newbie trying my hands on sentence segmentation in NLP. I am aware tokenizers are available for the same in NLTK. But I wanted to build my own sentence segmenter using Machine Learning algorithm like Decision Tree. But I am not able to gather training data for it. How should be the data. How should it be labelled, since I wanted to try first using supervised learning. Any sample data already available? Any help will be useful. I searched in net for nearly a week and now posting the

The failure in using CRF+0.58 train NE Model

阅读更多关于 The failure in using CRF+0.58 train NE Model

问题 when i use CRF++0.58 to model a NE and progarm have a problem: "reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s" the develop environment: red hat linux 6.5,gcc 5.0,CRF++0.58 written feature template: template dataset: Boson_train.txt Boson_test.txt the first column is words ,the second column is pos,the third column is NER tagger the problem: when i want to train the NER model, i type this sentences "crf_learn -f 3 -c 4.0 template Boson_train crf_model", and

NLTK - Chunk grammar doesn't read commas

阅读更多关于 NLTK - Chunk grammar doesn't read commas

问题 from nltk.chunk.util import tagstr2tree from nltk import word_tokenize, pos_tag text = "John Rose Center is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike ,Reebok Center." tagged_text = pos_tag(text.split()) grammar = "NP:{<NNP>+}" cp = nltk.RegexpParser(grammar) result = cp.parse(tagged_text) print(result) Output: (S (NP John/NNP Rose/NNP Center/NNP) is/VBZ very/RB beautiful/JJ place/NN and/CC i/NN want/VBP to/TO go/VB there/RB with

Using DocumentTermMatrix on a Vector of First and Last Names

阅读更多关于 Using DocumentTermMatrix on a Vector of First and Last Names

问题 I have a column in my data frame (df) as follows: > people = df$people > people[1:3] [1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner" [2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden" [3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer" The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that

proposed nlp algorithm for text tagging

阅读更多关于 proposed nlp algorithm for text tagging

问题 I was looking for opensource tool which can help to identify the tags for any user post on social media and identifying topic/off-topic or spam comment on that post. Even after looking for entire day, I could not find any suitable tool/library. Here I have proposed my own algorithm for tagging user post belonging to 7 categories (jobs, discussion, events, articles, services, buy/sell, talents). Initially when user makes post, he tags his post. Tags can be like marketing, suggestion,

How to search a corpus to find frequency of a string?

阅读更多关于 How to search a corpus to find frequency of a string?

问题 I'm working on an NLP project and I'd like to search through a corpus of text to try to find the frequency of a given verb-object pair. The aim would be to find which verb-object pair is most likely when given a few different possibilities. For example, if given the strings "Swing the stick" and "Eat the stick" I would hope that the corpus would show it's much more likely for someone to swing a stick than eat one. I've been reading about n-grams and corpus linguistics but I'm struggling to