nltk

python连接数据库

僤鯓⒐⒋嵵緔 提交于 2020-01-03 11:12:51
31 页 from bs4 import BeautifulSoup from collections import Counter from nltk . corpus import stopwords from nltk import LancasterStemmer import urllib . request URL = input ( "Enter a website" ) with urllib . request . urlopen ( URL ) as infile : soup = BeautifulSoup ( infile ) words = nltk . word_tokenize ( soup . text ) text = [ w . lower ( ) for w in words ] words = [ LancasterStemmer ( ) . stem ( w ) for w in text if w not in stopwords . words ( "english" ) and w . isalnum ( ) ] freqs = Counter ( words ) print ( freqs . most_common ( 10 ) ) 139 import nltk , pymysql conn = pymysql .

code for counting number of sentences, words and characters in an input file

删除回忆录丶 提交于 2020-01-03 05:46:08
问题 I have written the following code to count the number of sentences, words and characters in the input file sample.txt, which contains a paragraph of text. It works fine in giving the number of sentences and words, but does not give the precise and correct number of characters ( without whitespaces and punctuation marks) lines,blanklines,sentences,words=0,0,0,0 num_chars=0 print '-'*50 try: filename = 'sample.txt' textf = open(filename,'r')c except IOError: print 'cannot open file %s for

Unable to use Pandas and NLTK to train Naive Bayes (machine learning) in Python

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-03 04:46:07
问题 Here is what I am trying to do. I have a csv. file with column 1 with people's names (ie: "Michael Jordan", "Anderson Silva", "Muhammad Ali") and column 2 with people's ethnicity (ie: English, French, Chinese). In my code, I create the pandas data frame using all the data. Then create additional data frames: one with only Chinese names and another one with only non-Chinese names. And then I create separate lists. The three_split function extracts the feature of each name by splitting them

Python: encounter problems in sentence segmenter, word tokenizer, and part-of-speech tagger

两盒软妹~` 提交于 2020-01-03 04:45:16
问题 I am trying to read text file into Python, and then do sentence segmenter, word tokenizer, and part-of-speech tagger. This is my code: file=open('C:/temp/1.txt','r') sentences = nltk.sent_tokenize(file) sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] When I try just second command, it displayed error: Traceback (most recent call last): File "<pyshell#26>", line 1, in <module> sentences = nltk.sent_tokenize(file) File "D:

How to lemmatize a list of sentences

[亡魂溺海] 提交于 2020-01-03 02:40:07
问题 How can I lemmatize a list of sentences in Python? from nltk.stem.wordnet import WordNetLemmatizer a = ['i like cars', 'cats are the best'] lmtzr = WordNetLemmatizer() lemmatized = [lmtzr.lemmatize(word) for word in a] print(lemmatized) This is what I've tried but it gives me the same sentences. Do I need to tokenize the words before to work properly? 回答1: TL;DR : pip3 install -U pywsd Then: >>> from pywsd.utils import lemmatize_sentence >>> text = 'i like cars' >>> lemmatize_sentence(text) [

Using NLTK tokenizer with utf8 [duplicate]

巧了我就是萌 提交于 2020-01-02 19:30:29
问题 This question already has answers here : Reading a UTF8 CSV file with Python (9 answers) Closed 3 years ago . I am a fairly new user of Python and I work mainly with imported text files, especially csv's, which give me headaches to process. I tried to read the docs like this one : https://docs.python.org/2/howto/unicode.html but I don't understand a clue of what is being said. I just want some straight down-to-earth explanation. For instance I want to tokenize a large number of verbatims

Unable to install nltk using pip

☆樱花仙子☆ 提交于 2020-01-02 04:42:09
问题 Hi I am unable to install nltk. I have already install Python. C:\Users>pip install nltk Downloading/unpacking nltk Cannot fetch index base URL https://pypi.python.org/simple/ Could not find any downloads that satisfy the requirement nltk Cleaning up... No distributions at all found for nltk Storing debug log for failure in C:\Users\pinnapav\pip\pip.log 回答1: Try to use the command py -m pip install --upgrade nltk ! This worked on my computer, with the same, basic Python-Installation. Now you

nltk function to count occurrences of certain words

牧云@^-^@ 提交于 2020-01-02 03:45:08
问题 In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?" I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of

How do I use NLTK's default tokenizer to get spans instead of strings?

北城以北 提交于 2020-01-02 00:56:09
问题 NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box. >>> nltk.word_tokenize("(Dr. Edwards is my friend.)") ['(', 'Dr.', 'Edwards', 'is', 'my', 'friend', '.', ')'] I'd like to use this same algorithm except to have it return tuples of offsets into the original string instead of string tokens. By offset I mean 2-ples that can serve as indexes into the original

Storing conditional frequency distribution using NLTK

二次信任 提交于 2020-01-01 22:12:11
问题 I'm writing a script for text prediction using NLTK's Conditional Frequency Distribution. I want to store the distribution in SQL database for later usage using JSON. Is it even possible? If yes, how to dump the ConditionalFrequencyDistribution format using JSON? Or maybe there is some other nifty way of storing it? cfd = ConditionalFreqDist() prev_words = None cnt=0 for word in words: if cnt > 1: prev_words = words[cnt-2]+' '+words[cnt-1] cfd[prev_words].inc(word) cnt+=1 回答1: you could use