nltk | 易学教程

How to extract countries from a text?

阅读更多关于 How to extract countries from a text?

问题 I use Python 3 (I also have Python 2 installed) and I want to extract countries or cities from a short text. For example, text = "I live in Spain" or text = "United States (New York), United Kingdom (London)" . The answer for countries: Spain [United States, United Kingdom] I tried to install geography but I am unable to run pip install geography . I get this error: Collecting geography Could not find a version that satisfies the requirement geography (from versions: ) No matching

Extract entities from Multiple Subject passive sentence by Spacy

阅读更多关于 Extract entities from Multiple Subject passive sentence by Spacy

问题 Using Python Spacy, I am trying to extract entities from multiple subject passive voice sentence. Sentence = "John and Jenny were accused of crimes by David" My intention is to extract both "John and Jenny” from the sentence as nsubjpass and .ent_ . However, I am only able to extract “John” as nsubjpass. How to extract both them? Notice that while John is found as an entity in .ents, Jenny is considered as conj instead of nsubjpass. How to improve it? code each_sentence3 = "John and Jenny

Extracting Key-Phrases from text based on the Topic with Python

阅读更多关于 Extracting Key-Phrases from text based on the Topic with Python

问题 I have a large dataset with 3 columns, columns are text, phrase and topic. I want to find a way to extract key-phrases (phrases column) based on the topic. Key-Phrase can be part of the text value or the whole text value. import pandas as pd text = ["great game with a lot of amazing goals from both teams", "goalkeepers from both teams made misteke", "he won all four grand slam championchips", "the best player from three-point line", "Novak Djokovic is the best player of all time", "amazing

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

阅读更多关于 Generating dictionaries to categorize tweets into pre-defined categories using NLTK

问题 I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets. As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) : I am trying to generate dictionaries of common words under each category so

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

阅读更多关于 Generating dictionaries to categorize tweets into pre-defined categories using NLTK

NLTK: set proxy server

阅读更多关于 NLTK: set proxy server

问题 I'm trying to learn NLTK - Natural Language Toolkit written in Python and I want install a sample data set to run some examples. My web connection uses a proxy server, and I'm trying to specify the proxy address as follows: >>> nltk.set_proxy('http://proxy.example.com:3128' ('USERNAME', 'PASSWORD')) >>> nltk.download() But I get an error: Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object is not callable I decided to set up a ProxyBasicAuthHandler

Parsing city of origin / destination city from a string

阅读更多关于 Parsing city of origin / destination city from a string

问题 I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ultimately have two new columns titled 'origin' and 'destination'). The data: df_col = [ 'new york to venice, italy for usd271', 'return flights from brussels to bangkok with etihad from â‚¬407', 'from los angeles to guadalajara, mexico for usd191', 'fly to australia new zealand from paris from â‚¬422

Parsing city of origin / destination city from a string

阅读更多关于 Parsing city of origin / destination city from a string

Inverse Document Frequency Formula

阅读更多关于 Inverse Document Frequency Formula

问题 I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect. I keep reading that idf(term) = log(# of docs/ # of docs with term) If so, won't you get a divide by zero error if there are no docs with the term? To solve that problem, I read that you do log (# of docs / # of docs with term + 1 ) But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me. What am

Split Sentences at Bullets and Numbering?

阅读更多关于 Split Sentences at Bullets and Numbering?

问题 I am trying to input text into my word processor to be split into sentences first and then into words. An example paragraph: When the blow was repeated,together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. 1) This a numbered sentence 2) This is the second numbered sentence At the same time with his ears and his eyes he offered a small prayer to the child. Below are the examples - This an example of bullet point sentence - This