lemmatization

How to provide (or generate) tags for nltk lemmatizers

≯℡__Kan透↙ 提交于 2021-02-19 06:14:24
问题 I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers). I thought that it is enough to call WordNetLemmatizer.lemmatize(word), and then PorterStemmer - but all 'have', 'has', 'had', etc are not being transformed to 'have' by the lemmatizer - and it goes for other words as well. Then I have read, that I am supposed to provide a hint

How to provide (or generate) tags for nltk lemmatizers

元气小坏坏 提交于 2021-02-19 06:13:48
问题 I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers). I thought that it is enough to call WordNetLemmatizer.lemmatize(word), and then PorterStemmer - but all 'have', 'has', 'had', etc are not being transformed to 'have' by the lemmatizer - and it goes for other words as well. Then I have read, that I am supposed to provide a hint

How to provide (or generate) tags for nltk lemmatizers

僤鯓⒐⒋嵵緔 提交于 2021-02-19 06:11:42
问题 I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers). I thought that it is enough to call WordNetLemmatizer.lemmatize(word), and then PorterStemmer - but all 'have', 'has', 'had', etc are not being transformed to 'have' by the lemmatizer - and it goes for other words as well. Then I have read, that I am supposed to provide a hint

Spacy lemmatizer issue/consistency

徘徊边缘 提交于 2021-02-11 18:21:09
问题 I'm currently using spaCy for NLP purpose (mainly lemmatization and tokenization). The model used is en-core-web-sm (2.1.0). The following code is run to retrieve a list of words "cleansed" from a query import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(query) list_words = [] for token in doc: if token.text != ' ': list_words.append(token.lemma_) However I face a major issue, when running this code. For example, when the query is "processing of tea leaves". The result stored in list

Strange lemmatization result in r, textstem package

北城以北 提交于 2021-01-29 09:30:29
问题 I would like to get lemma "dive" from all possible forms of the word using textstem package in R. But when I used textstem package in r, the basic form becomes a very strange result. library(textstem) words<-c("dived", "diving", "dive") lemmatize_strings(words, dictionary = lexicon::hash_lemmas) [1] "dive" "dive" "diva" Here, I do not want "dive" as a result from a word "dive", instead I need to lemmatize the word "dive" into "dive", so it can be counted as the same word with other forms

SQL word root matching

夙愿已清 提交于 2021-01-27 07:41:50
问题 I'm wondering whether major SQL engines out there (MS SQL, Oracle, MySQL) have the ability to understand that 2 words are related because they share the same root. We know it's easy to match "networking" when searching for "network" because the latter is a substring of the former. But do SQL engines have functions that can match "network" when searching for "networking"? Thanks a lot. 回答1: This functionality is called a stemmer: an algorithm that can deduce a stem from any form of the word.

Wordpiece tokenization versus conventional lemmatization?

蹲街弑〆低调 提交于 2021-01-02 06:28:10
问题 I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing"). Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece

OpenNLP: Unable to locate the model file for Lemmatizer

杀马特。学长 韩版系。学妹 提交于 2020-12-12 18:18:46
问题 Summary : Unable to find the model file used for Lemmatizer (english-lemmatizer.bin) Details : OpenNLP Tools Models appears to be a comprehensive repository for the various models used by the different components of the Apache OpenNLP library. However, I am unable to find the model file en-lemmatizer.bin , which is used with the lemmatizer. The Apache OpenNLP Developer Manual provides the following code snippet for the Lemmatization step: InputStream dictLemmatizer = null; try (dictLemmatizer