nlp

Converting list of strings with u'…' to a list of normal strings [duplicate]

僤鯓⒐⒋嵵緔 提交于 2019-12-12 03:15:31
问题 This question already has answers here : What's the u prefix in a Python string? (6 answers) Closed 3 years ago . I'm a newbie in python. And apologies for a very basic question. I'm working with python pattern.en library and try to get the synonyms of a word. this is my code and is working fine. from pattern.en import wordnet a=wordnet.synsets('human') print a[0].synonyms this what the output i get from this: [u'homo', u'man', u'human being', u'human'] but for my program i need to insert

Add categorical variable(gender) to Sparse Matrix for Multiclass Classification using sklearn

南楼画角 提交于 2019-12-12 03:15:02
问题 I am building a multiclass classification model using sklearn. I am converting my tweets into a 571x1815 sparse matrix of type with 34737 stored elements in Compressed Sparse Row format. I am trying to predict age groups based on history of tweets but I want to add an exogenous categorical variable (gender) to my sparse matrix and they use either Decision Tree or Random Forest to do my prediction. How do I add a vector to a sparse matrix? def vectorize(df): bow_transformer = CountVectorizer

Extract Word from Synset using Wordnet in NLTK 3.0

喜夏-厌秋 提交于 2019-12-12 02:59:48
问题 Some time ago, someone on SO asked how to retrieve a list of words for a given synset using NLTK's wordnet wrapper. Here is one of the suggested responses: for synset in wn.synsets('dog'): print synset.lemmas[0].name Running this code with NLTK 3.0 yields TypeError: 'instancemethod' object is not subscriptable . I tried each of the previously-proposed solutions (each of the solutions described on the page linked above), but each throws an error. I therefore wanted to ask: Is it possible to

POS tagging using nltk takes time

﹥>﹥吖頭↗ 提交于 2019-12-12 02:48:56
问题 I am trying to get POS tags using nltk, i think it should take less then or around 1 sencond for processing small text. But 2-3 sentence it takes 20-25 second. import nltk,re, time def findPos( text): start_time = time.time() try: tokens = nltk.word_tokenize(text) pos_tags = nltk.pos_tag(tokens) print [ x[0] for x in pos_tags if x[1] == "NN" or "NNP"] except Exception: import traceback traceback.format_exc() print("--- %s seconds ---" % (time.time() - start_time)) findPos(raw_input()) Any

NLP - Sentence Segmentation

柔情痞子 提交于 2019-12-12 02:43:49
问题 I am a newbie trying my hands on sentence segmentation in NLP. I am aware tokenizers are available for the same in NLTK. But I wanted to build my own sentence segmenter using Machine Learning algorithm like Decision Tree. But I am not able to gather training data for it. How should be the data. How should it be labelled, since I wanted to try first using supervised learning. Any sample data already available? Any help will be useful. I searched in net for nearly a week and now posting the

The failure in using CRF+0.58 train NE Model

陌路散爱 提交于 2019-12-12 02:37:39
问题 when i use CRF++0.58 to model a NE and progarm have a problem: "reading training data:tagger.cpp(399) [feature_index_->buildFeatures(this)] 0.00s" the develop environment: red hat linux 6.5,gcc 5.0,CRF++0.58 written feature template: template dataset: Boson_train.txt Boson_test.txt the first column is words ,the second column is pos,the third column is NER tagger the problem: when i want to train the NER model, i type this sentences "crf_learn -f 3 -c 4.0 template Boson_train crf_model", and

NLTK - Chunk grammar doesn't read commas

*爱你&永不变心* 提交于 2019-12-12 02:36:09
问题 from nltk.chunk.util import tagstr2tree from nltk import word_tokenize, pos_tag text = "John Rose Center is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike ,Reebok Center." tagged_text = pos_tag(text.split()) grammar = "NP:{<NNP>+}" cp = nltk.RegexpParser(grammar) result = cp.parse(tagged_text) print(result) Output: (S (NP John/NNP Rose/NNP Center/NNP) is/VBZ very/RB beautiful/JJ place/NN and/CC i/NN want/VBP to/TO go/VB there/RB with

Using DocumentTermMatrix on a Vector of First and Last Names

空扰寡人 提交于 2019-12-12 02:26:44
问题 I have a column in my data frame (df) as follows: > people = df$people > people[1:3] [1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner" [2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden" [3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer" The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that

proposed nlp algorithm for text tagging

浪子不回头ぞ 提交于 2019-12-12 02:06:39
问题 I was looking for opensource tool which can help to identify the tags for any user post on social media and identifying topic/off-topic or spam comment on that post. Even after looking for entire day, I could not find any suitable tool/library. Here I have proposed my own algorithm for tagging user post belonging to 7 categories (jobs, discussion, events, articles, services, buy/sell, talents). Initially when user makes post, he tags his post. Tags can be like marketing, suggestion,

How to search a corpus to find frequency of a string?

风格不统一 提交于 2019-12-12 01:45:21
问题 I'm working on an NLP project and I'd like to search through a corpus of text to try to find the frequency of a given verb-object pair. The aim would be to find which verb-object pair is most likely when given a few different possibilities. For example, if given the strings "Swing the stick" and "Eat the stick" I would hope that the corpus would show it's much more likely for someone to swing a stick than eat one. I've been reading about n-grams and corpus linguistics but I'm struggling to