nlp

How write code and run python's files using spaCy? (using windows)

天涯浪子 提交于 2019-12-11 15:27:41
问题 I want to implement a new model language for spaCY. I have installed spaCy (using the guide of the official web site) on my Windows SO but I haven't understand where and how I could write and run my future files. Help me, Thanks. 回答1: I hope I understand your question correctly: If you only want to use spaCy, you can simply create a Python file, import spacy and run it. However, if you want to add things to the spaCy source – for example to add new language data that doesn't yet exist – you

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

流过昼夜 提交于 2019-12-11 15:17:10
问题 I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of : bigrams + trigrams + word-marks vocabulary He means by word-marks here, the words that are specific to a certain dialect. How can I tweak those parameters in countVectorizer? word marks So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them. word_marks=['love', 'funny', 'happy',

Removing commas and unlisting a dataframe

随声附和 提交于 2019-12-11 15:08:22
问题 Background I have the following sample df : import pandas as pd df = pd.DataFrame({'Before' : [['there, are, many, different'], ['i, like, a, lot, of, sports '], ['the, middle, east, has, many']], 'After' : [['in, the, bright, blue, box'], ['because, they, go, really, fast'], ['to, ride, and, have, fun'] ], 'P_ID': [1,2,3], 'Word' : ['crayons', 'cars', 'camels'], 'N_ID' : ['A1', 'A2', 'A3'] }) Output After Before N_ID P_ID Word 0 [in, the, bright, blue, box] [there, are, many, different] A1 1

How to get similar words related to one word?

橙三吉。 提交于 2019-12-11 15:03:19
问题 I am trying to solve a nlp problem where i have a dict of words like : list_1={'phone':'android','chair':'netflit','charger':'macbook','laptop','sony'} Now if input is 'phone' i can easily use 'in' operator to get the description of phone and its data by key but problem is if input is something like 'phones' or 'Phones' . I want if i input 'phone' then i get words like 'phone' ==> 'Phones','phones','Phone','Phone's','phone's' I don't know which word2vec i can use and which nlp module can

extracting n-grams from tweets in python

主宰稳场 提交于 2019-12-11 14:56:15
问题 Say that I have 100 tweets. In those tweets, I need to extract: 1) food names, and 2) beverage names. Example of tweet: "Yesterday I had a coca cola, and a hot dog for lunch, and some bana split for desert. I liked the coke, but the banana in the banana split dessert was ripe" I have to my disposal two lexicons. One with food names, and one with beverage names. Example in food names lexicon: "hot dog" "banana" "banana split" Example in beverage names lexicon: "coke" "cola" "coca cola" What I

spacy rule matcher on unit of measure before or after digit

霸气de小男生 提交于 2019-12-11 14:32:00
问题 I am new to spacy and i am trying to match some measurements in some text. My problem is that the unit of measure sometimes is before, sometimes is after the value. In some other cases has a different name. Here is some code: nlp = spacy.load('en_core_web_sm') # case 1: text = "the surface is 31 sq" # case 2: # text = "the surface is sq 31" # case 3: # text = "the surface is square meters 31" # case 4: # text = "the surface is 31 square meters" # case 5: # text = "the surface is about 31

How to extract sentence containing a particular word from millions of paragraphs

安稳与你 提交于 2019-12-11 13:52:42
问题 I scrapped millions of newspaper articles using Python Scrapy. Now, I want to extract a sentence containing a word. Below is my implementation. import nltk tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') for a in articles: article_sentence = tokenizer.tokenize(a) for s in article_sentence: for w in words: if ' '+w+' ' in s: sentences[w].append(s) I have around ~1000 words. The above code is not efficient and takes a lot of time. Also, the sentence can contain root word in

IR and QA - Beginner Project Scope

五迷三道 提交于 2019-12-11 13:31:13
问题 I have been brainstorming for an Undergraduate Project in Question Answering domain. A project that has components of IR and NLP. The first thing that popped up, was of course factoid question answering, but that seemed to be an already conquered problem. #IBM Watson! Non-factoid QA seems interesting, so I took it up. Now, we are in scope-it-out phase of the project description. So, from the ambitious goal - of answering any question put up by the user - I need to scope out our project. So I

detecting POS tag pattern along with specified words

家住魔仙堡 提交于 2019-12-11 12:52:43
问题 I need to identify certain POS tags before/after certain specified words, for example the following tagged sentence: [('This', 'DT'), ('feature', 'NN'), ('would', 'MD'), ('be', 'VB'), ('nice', 'JJ'), ('to', 'TO'), ('have', 'VB')] can be abstracted to the form "would be" + Adjective Similarly: [('I', 'PRP'), ('am', 'VBP'), ('able', 'JJ'), ('to', 'TO'), ('delete', 'VB'), ('the', 'DT'), ('group', 'NN'), ('functionality', 'NN')] is of the form "am able to" + Verb How can I go about checking for

NLTK identifies verb as Noun in Imperatives

安稳与你 提交于 2019-12-11 12:07:49
问题 I am using NLTK POS tagger as below sent1='get me now' sent2='run fast' tags=pos_tag(word_tokenize(sent2)) print tags [('run', 'NN'), ('fast', 'VBD')] I find similar posts NLTK Thinks that Imperatives are Nouns which suggest to add the word to a dictionary as a verb. Problem is I have too many such unknown words. But one clue I have, they always appear at the start of a phrase. Eg: 'Download now', 'Book it now', 'Sign up' How can i correctly assist the NLTK to produce correct result 回答1: