stemming | 易学教程

How to provide (or generate) tags for nltk lemmatizers

阅读更多关于 How to provide (or generate) tags for nltk lemmatizers

问题 I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers). I thought that it is enough to call WordNetLemmatizer.lemmatize(word), and then PorterStemmer - but all 'have', 'has', 'had', etc are not being transformed to 'have' by the lemmatizer - and it goes for other words as well. Then I have read, that I am supposed to provide a hint

How to provide (or generate) tags for nltk lemmatizers

阅读更多关于 How to provide (or generate) tags for nltk lemmatizers

How to provide (or generate) tags for nltk lemmatizers

阅读更多关于 How to provide (or generate) tags for nltk lemmatizers

How to find all the related keywords for a root word?

阅读更多关于 How to find all the related keywords for a root word?

问题 I am trying to figure out a way to find all the keywords that come from the same root word (in some sense the opposite action of stemming). Currently, I am using R for coding, but I am open to switching to a different language if it helps. For instance, I have the root word "rent" and I would like to be able to find "renting", "renter", "rental", "rents" and so on. 回答1: Try this code in python: from pattern.en import lexeme print(lexeme("rent") the output generated is: Installation : pip

How to find all the related keywords for a root word?

阅读更多关于 How to find all the related keywords for a root word?

Is there a way to reverse stem in python nltk?

阅读更多关于 Is there a way to reverse stem in python nltk?

问题 I have a list of stems in NLTK/python and want to get the possible words that create that stem. Is there a way to take a stem and get a list of words that will stem to it in python? 回答1: To the best of my knowledge the answer is No, and depending on the stemmer it might be difficult to come up with an exhaustive search for reverting the effect of the stemming rules and the results would be mostly invalid words by any standard. E.g for Porter stemmer: from nltk.stem.porter import * stemmer =

SQL word root matching

阅读更多关于 SQL word root matching

问题 I'm wondering whether major SQL engines out there (MS SQL, Oracle, MySQL) have the ability to understand that 2 words are related because they share the same root. We know it's easy to match "networking" when searching for "network" because the latter is a substring of the former. But do SQL engines have functions that can match "network" when searching for "networking"? Thanks a lot. 回答1: This functionality is called a stemmer: an algorithm that can deduce a stem from any form of the word.

How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

阅读更多关于 How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

问题 I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output? 回答1: Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words. For this, I exemplarily used the snowball stemmer from nltk. from nltk.stem.snowball import SnowballStemmer englishStemmer=SnowballStemmer("english

How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

阅读更多关于 How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

Why is the number of stem from NLTK Stemmer outputs different from expected output?

阅读更多关于 Why is the number of stem from NLTK Stemmer outputs different from expected output?

问题 I have to perform Stemming on a text. The questions are as follows : Tokenize all the words given in tc . The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw Convert all the words into lowercase. Store the result into the variable tw Remove all the stop words from the unique set of tw . Store the result into the variable fw Stem each word present in fw with PorterStemmer, and store the result in the list psw Below is my code : import re import