nltk | 易学教程

Split Sentences at Bullets and Numbering?

阅读更多关于 Split Sentences at Bullets and Numbering?

问题 I am trying to input text into my word processor to be split into sentences first and then into words. An example paragraph: When the blow was repeated,together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. 1) This a numbered sentence 2) This is the second numbered sentence At the same time with his ears and his eyes he offered a small prayer to the child. Below are the examples - This an example of bullet point sentence - This

pyspark RDD word calculate

阅读更多关于 pyspark RDD word calculate

问题 I have a dataframe with text and category. I want to count the words which are common in these categories. I am using nltk to remove the stop words and tokenize however not able to include the category in the process. Below is my sample code of the problem. from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession,Row import nltk spark_conf = SparkConf()\ .setAppName("test") sc=SparkContext.getOrCreate(spark_conf) def wordTokenize(x): words = [word for line in x for

NLTK was unable to find the java file! for Stanford POS Tagger

阅读更多关于 NLTK was unable to find the java file! for Stanford POS Tagger

问题 I have been stuck trying to get the Stanford POS Tagger to work for a while. From an old SO post I found the following (slightly modified) code: stanford_dir = 'C:/Users/.../stanford-postagger-2017-06-09/' from nltk.tag import StanfordPOSTagger #from nltk.tag.stanford import StanfordPOSTagger # I tried it both ways from nltk import word_tokenize # Add the jar and model via their path (instead of setting environment variables): jar = stanford_dir + 'stanford-postagger.jar' model = stanford_dir

Using Pyinstaller with NLTK results in error: can't find nltk_data

阅读更多关于 Using Pyinstaller with NLTK results in error: can't find nltk_data

问题 I am attempting to export a simple GUI that used NLTK as an exe with Python 3.6 and Windows 10. When I run PyInstaller to freeze my simple program as an exe I get the error: Unable to find "c:\users\usr\nltk_data" when adding binary and data files. When I even copied the nltk_data folder here and I get an error in a different nltk.data.path path "c:\users\usr\appdata\local\programs\python\python36\nltk_data" import tkinter as tk from nltk.corpus import stopwords sw = stopwords.words('english'

wordnet lemmatizer in NLTK is not working for adverbs [duplicate]

阅读更多关于 wordnet lemmatizer in NLTK is not working for adverbs [duplicate]

问题 This question already has answers here : Getting adjective from an adverb in nltk or other NLP library (2 answers) Closed 5 years ago . from nltk.stem import WordNetLemmatizer x = WordNetLemmatizer() x.lemmatize("angrily", pos='r') Out[41]: 'angrily' Here is reference documnetation for pos tags in nltk wordnet, http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html I may be missing some basic things. Please let me know 回答1: Try: >>> from nltk.corpus import wordnet as wn >>> wn.synset(

How to remove dates from a list in Python

阅读更多关于 How to remove dates from a list in Python

问题 I have a list of tokenized text (list_of_words) that looks something like this: list_of_words = ['08/20/2014', '10:04:27', 'pm', 'complet', 'vendor', 'per', 'mfg/recommend', '08/20/2014', '10:04:27', 'pm', 'complet', ...] and I'm trying to strip out all the instances of dates and times from this list. I've tried using the .remove() function, to no avail. I've tried passing wildcard characters, such as '../../...." to a list of stopwords I was sorting with, but that didn't work. I finally

Negation handling in NLP

阅读更多关于 Negation handling in NLP

问题 I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API. Here's an example: The movie wasn't that good. Hence, I figured that I could use wordnet's lemma functionality to replace adjectives in sentences that contain negation-words like (not, ...). In the previous example, the

Negation handling in NLP

阅读更多关于 Negation handling in NLP

how to get parse tree using python nltk?

阅读更多关于 how to get parse tree using python nltk?

问题 Given the following sentence: The old oak tree from India fell down. How can I get the following parse tree representation of the sentence using python NLTK? (ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down))))) I need a complete example which I couldn't find in web! Edit I have gone through this book chapter to learn about parsing using NLTK but the problem is, I need a grammar to parse sentences or phrases which I do not

Python NLP Text Tokenization based on custom regex

阅读更多关于 Python NLP Text Tokenization based on custom regex

问题 I am processing large amount of text for custom (NER) Named Entity Recognition using Spacy. For text pre-processing I am using nltk for tokenization..etc. I am able to process one of my custom entities which is based on simple strings. But the other custom entity is a combination of number and certain text (20 BBLs for example). The word_tokenize method from nltk.tokenize tokenizes 20 and 'BBLs' separately each as a separate token. What I want is to treat them (the number and the 'BBLs'