nltk

How to extract countries from a text?

你离开我真会死。 提交于 2020-07-20 08:21:12
问题 I use Python 3 (I also have Python 2 installed) and I want to extract countries or cities from a short text. For example, text = "I live in Spain" or text = "United States (New York), United Kingdom (London)" . The answer for countries: Spain [United States, United Kingdom] I tried to install geography but I am unable to run pip install geography . I get this error: Collecting geography Could not find a version that satisfies the requirement geography (from versions: ) No matching

Extract entities from Multiple Subject passive sentence by Spacy

坚强是说给别人听的谎言 提交于 2020-06-27 04:33:20
问题 Using Python Spacy, I am trying to extract entities from multiple subject passive voice sentence. Sentence = "John and Jenny were accused of crimes by David" My intention is to extract both "John and Jenny” from the sentence as nsubjpass and .ent_ . However, I am only able to extract “John” as nsubjpass. How to extract both them? Notice that while John is found as an entity in .ents, Jenny is considered as conj instead of nsubjpass. How to improve it? code each_sentence3 = "John and Jenny

Extracting Key-Phrases from text based on the Topic with Python

冷暖自知 提交于 2020-06-24 14:57:09
问题 I have a large dataset with 3 columns, columns are text, phrase and topic. I want to find a way to extract key-phrases (phrases column) based on the topic. Key-Phrase can be part of the text value or the whole text value. import pandas as pd text = ["great game with a lot of amazing goals from both teams", "goalkeepers from both teams made misteke", "he won all four grand slam championchips", "the best player from three-point line", "Novak Djokovic is the best player of all time", "amazing

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

给你一囗甜甜゛ 提交于 2020-06-24 12:21:19
问题 I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets. As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) : I am trying to generate dictionaries of common words under each category so

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

别等时光非礼了梦想. 提交于 2020-06-24 12:17:46
问题 I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets. As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) : I am trying to generate dictionaries of common words under each category so

NLTK: set proxy server

折月煮酒 提交于 2020-06-24 05:10:19
问题 I'm trying to learn NLTK - Natural Language Toolkit written in Python and I want install a sample data set to run some examples. My web connection uses a proxy server, and I'm trying to specify the proxy address as follows: >>> nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD')) >>> nltk.download() But I get an error: Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object is not callable I decided to set up a ProxyBasicAuthHandler

Parsing city of origin / destination city from a string

*爱你&永不变心* 提交于 2020-06-22 06:54:13
问题 I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ultimately have two new columns titled 'origin' and 'destination'). The data: df_col = [ 'new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422

Parsing city of origin / destination city from a string

强颜欢笑 提交于 2020-06-22 06:53:30
问题 I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ultimately have two new columns titled 'origin' and 'destination'). The data: df_col = [ 'new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from €407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from €422

Inverse Document Frequency Formula

谁说胖子不能爱 提交于 2020-06-15 07:25:38
问题 I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect. I keep reading that idf(term) = log(# of docs/ # of docs with term) If so, won't you get a divide by zero error if there are no docs with the term? To solve that problem, I read that you do log (# of docs / # of docs with term + 1 ) But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me. What am

Split Sentences at Bullets and Numbering?

坚强是说给别人听的谎言 提交于 2020-06-09 03:42:24
问题 I am trying to input text into my word processor to be split into sentences first and then into words. An example paragraph: When the blow was repeated,together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. 1) This a numbered sentence 2) This is the second numbered sentence At the same time with his ears and his eyes he offered a small prayer to the child. Below are the examples - This an example of bullet point sentence - This