nlp

real word count in NLTK

心不动则不痛 提交于 2019-12-18 06:49:21
问题 The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count: text = nltk.Text(tokens) len(text) However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)? Similarly, how can you get the average number of characters in a word? The obvious answer is: word_average_length =(len(string_of_text)/len

real word count in NLTK

别来无恙 提交于 2019-12-18 06:49:04
问题 The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count: text = nltk.Text(tokens) len(text) However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)? Similarly, how can you get the average number of characters in a word? The obvious answer is: word_average_length =(len(string_of_text)/len

Finding Tense of A sentence using stanford nlp

半腔热情 提交于 2019-12-18 05:54:00
问题 Q1.I am trying to get tense of a complete sentence,just don't know how to do it using nlp. Any help appreciated. Q2 .What all information can be extracted from a sentence using nlp? Currently I can, I get : 1.Voice of sentence 2.subject object verb 3.POS tags. Any more info can be extracted please let me know. 回答1: The Penn treebank defines VBD and VBN as the past tense and the past participle of a verb, respectively. In many sentences, simply getting the POS tags and checking for the

Finding Tense of A sentence using stanford nlp

笑着哭i 提交于 2019-12-18 05:52:20
问题 Q1.I am trying to get tense of a complete sentence,just don't know how to do it using nlp. Any help appreciated. Q2 .What all information can be extracted from a sentence using nlp? Currently I can, I get : 1.Voice of sentence 2.subject object verb 3.POS tags. Any more info can be extracted please let me know. 回答1: The Penn treebank defines VBD and VBN as the past tense and the past participle of a verb, respectively. In many sentences, simply getting the POS tags and checking for the

Training Tagger with Custom Tags in NLTK

无人久伴 提交于 2019-12-18 05:03:16
问题 I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York] . I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags. 回答1: As

How to extract nouns using NLTK pos_tag()?

本小妞迷上赌 提交于 2019-12-18 04:21:29
问题 I am fairly new to python. I am not able to figure out the bug. I want to extract nouns using NLTK. I have written the following code: import nltk sentence = "At eight o'clock on Thursday film morning word line test best beautiful Ram Aaron design" tokens = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(tokens) length = len(tagged) - 1 a = list() for i in (0,length): log = (tagged[i][1][0] == 'N') if log == True: a.append(tagged[i][0]) When I run this, 'a' only has one element a ['detail'

How to extract nouns using NLTK pos_tag()?

╄→尐↘猪︶ㄣ 提交于 2019-12-18 04:21:10
问题 I am fairly new to python. I am not able to figure out the bug. I want to extract nouns using NLTK. I have written the following code: import nltk sentence = "At eight o'clock on Thursday film morning word line test best beautiful Ram Aaron design" tokens = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(tokens) length = len(tagged) - 1 a = list() for i in (0,length): log = (tagged[i][1][0] == 'N') if log == True: a.append(tagged[i][0]) When I run this, 'a' only has one element a ['detail'

How to avoid NLTK's sentence tokenizer splitting on abbreviations?

[亡魂溺海] 提交于 2019-12-18 04:01:50
问题 I'm currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing. Here's the problem: Assume I have a sentence: "Fig. 2 shows a U.S.A. map." When I use punkt tokenizer, my code looks like this: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() abbreviation = ['U.S.A', 'fig'] punkt_param.abbrev_types = set(abbreviation) tokenizer = PunktSentenceTokenizer(punkt_param) tokenizer.tokenize('Fig. 2 shows

Stanford Dependency Parser Setup and NLTK

烈酒焚心 提交于 2019-12-18 03:43:29
问题 So I got the "standard" Stanford Parser to work thanks to danger89's answers to this previous post, Stanford Parser and NLTK. However, I am now trying to get the dependency parser to work and it seems the method highlighted in the previous link no longer works. Here is my code: import nltk import os java_path = "C:\\Program Files\\Java\\jre1.8.0_51\\bin\\java.exe" os.environ['JAVAHOME'] = java_path from nltk.parse import stanford os.environ['STANFORD_PARSER'] = 'path/jar' os.environ['STANFORD

nltk StanfordNERTagger : NoClassDefFoundError: org/slf4j/LoggerFactory (In Windows)

為{幸葍}努か 提交于 2019-12-18 02:47:47
问题 NOTE: I am using Python 2.7 as part of Anaconda distribution. I hope this is not a problem for nltk 3.1. I am trying to use nltk for NER as import nltk from nltk.tag.stanford import StanfordNERTagger #st = StanfordNERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar') st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') print st.tag(str) but i get Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory at edu.stanford.nlp