nltk

Split Sentences at Bullets and Numbering?

℡╲_俬逩灬. 提交于 2020-06-09 03:42:08
问题 I am trying to input text into my word processor to be split into sentences first and then into words. An example paragraph: When the blow was repeated,together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. 1) This a numbered sentence 2) This is the second numbered sentence At the same time with his ears and his eyes he offered a small prayer to the child. Below are the examples - This an example of bullet point sentence - This

pyspark RDD word calculate

心已入冬 提交于 2020-05-28 11:53:25
问题 I have a dataframe with text and category. I want to count the words which are common in these categories. I am using nltk to remove the stop words and tokenize however not able to include the category in the process. Below is my sample code of the problem. from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession,Row import nltk spark_conf = SparkConf()\ .setAppName("test") sc=SparkContext.getOrCreate(spark_conf) def wordTokenize(x): words = [word for line in x for

NLTK was unable to find the java file! for Stanford POS Tagger

徘徊边缘 提交于 2020-05-26 05:06:19
问题 I have been stuck trying to get the Stanford POS Tagger to work for a while. From an old SO post I found the following (slightly modified) code: stanford_dir = 'C:/Users/.../stanford-postagger-2017-06-09/' from nltk.tag import StanfordPOSTagger #from nltk.tag.stanford import StanfordPOSTagger # I tried it both ways from nltk import word_tokenize # Add the jar and model via their path (instead of setting environment variables): jar = stanford_dir + 'stanford-postagger.jar' model = stanford_dir

Using Pyinstaller with NLTK results in error: can't find nltk_data

≡放荡痞女 提交于 2020-05-14 03:46:13
问题 I am attempting to export a simple GUI that used NLTK as an exe with Python 3.6 and Windows 10. When I run PyInstaller to freeze my simple program as an exe I get the error: Unable to find "c:\users\usr\nltk_data" when adding binary and data files. When I even copied the nltk_data folder here and I get an error in a different nltk.data.path path "c:\users\usr\appdata\local\programs\python\python36\nltk_data" import tkinter as tk from nltk.corpus import stopwords sw = stopwords.words('english'

wordnet lemmatizer in NLTK is not working for adverbs [duplicate]

牧云@^-^@ 提交于 2020-05-13 14:42:06
问题 This question already has answers here : Getting adjective from an adverb in nltk or other NLP library (2 answers) Closed 5 years ago . from nltk.stem import WordNetLemmatizer x = WordNetLemmatizer() x.lemmatize("angrily", pos='r') Out[41]: 'angrily' Here is reference documnetation for pos tags in nltk wordnet, http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html I may be missing some basic things. Please let me know 回答1: Try: >>> from nltk.corpus import wordnet as wn >>> wn.synset(

How to remove dates from a list in Python

痞子三分冷 提交于 2020-05-12 20:35:59
问题 I have a list of tokenized text (list_of_words) that looks something like this: list_of_words = ['08/20/2014', '10:04:27', 'pm', 'complet', 'vendor', 'per', 'mfg/recommend', '08/20/2014', '10:04:27', 'pm', 'complet', ...] and I'm trying to strip out all the instances of dates and times from this list. I've tried using the .remove() function, to no avail. I've tried passing wildcard characters, such as '../../...." to a list of stopwords I was sorting with, but that didn't work. I finally

Negation handling in NLP

我只是一个虾纸丫 提交于 2020-05-10 03:26:50
问题 I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API. Here's an example: The movie wasn't that good. Hence, I figured that I could use wordnet's lemma functionality to replace adjectives in sentences that contain negation-words like (not, ...). In the previous example, the

Negation handling in NLP

杀马特。学长 韩版系。学妹 提交于 2020-05-10 03:26:20
问题 I'm currently working on a project, where I want to extract emotion from text. As I'm using conceptnet5 (a semantic network), I can't however simply prefix words in a sentence that contains a negation-word, as those words would simply not show up in conceptnet5's API. Here's an example: The movie wasn't that good. Hence, I figured that I could use wordnet's lemma functionality to replace adjectives in sentences that contain negation-words like (not, ...). In the previous example, the

how to get parse tree using python nltk?

一世执手 提交于 2020-05-09 18:36:27
问题 Given the following sentence: The old oak tree from India fell down. How can I get the following parse tree representation of the sentence using python NLTK? (ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down))))) I need a complete example which I couldn't find in web! Edit I have gone through this book chapter to learn about parsing using NLTK but the problem is, I need a grammar to parse sentences or phrases which I do not

Python NLP Text Tokenization based on custom regex

时光毁灭记忆、已成空白 提交于 2020-05-09 16:02:28
问题 I am processing large amount of text for custom (NER) Named Entity Recognition using Spacy. For text pre-processing I am using nltk for tokenization..etc. I am able to process one of my custom entities which is based on simple strings. But the other custom entity is a combination of number and certain text (20 BBLs for example). The word_tokenize method from nltk.tokenize tokenizes 20 and 'BBLs' separately each as a separate token. What I want is to treat them (the number and the 'BBLs'