lemmatization | 易学教程

How to provide (or generate) tags for nltk lemmatizers

阅读更多关于 How to provide (or generate) tags for nltk lemmatizers

问题 I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers). I thought that it is enough to call WordNetLemmatizer.lemmatize(word), and then PorterStemmer - but all 'have', 'has', 'had', etc are not being transformed to 'have' by the lemmatizer - and it goes for other words as well. Then I have read, that I am supposed to provide a hint

How to provide (or generate) tags for nltk lemmatizers

阅读更多关于 How to provide (or generate) tags for nltk lemmatizers

How to provide (or generate) tags for nltk lemmatizers

阅读更多关于 How to provide (or generate) tags for nltk lemmatizers

Spacy lemmatizer issue/consistency

阅读更多关于 Spacy lemmatizer issue/consistency

问题 I'm currently using spaCy for NLP purpose (mainly lemmatization and tokenization). The model used is en-core-web-sm (2.1.0). The following code is run to retrieve a list of words "cleansed" from a query import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(query) list_words = [] for token in doc: if token.text != ' ': list_words.append(token.lemma_) However I face a major issue, when running this code. For example, when the query is "processing of tea leaves". The result stored in list

Strange lemmatization result in r, textstem package

阅读更多关于 Strange lemmatization result in r, textstem package

问题 I would like to get lemma "dive" from all possible forms of the word using textstem package in R. But when I used textstem package in r, the basic form becomes a very strange result. library(textstem) words<-c("dived", "diving", "dive") lemmatize_strings(words, dictionary = lexicon::hash_lemmas) [1] "dive" "dive" "diva" Here, I do not want "dive" as a result from a word "dive", instead I need to lemmatize the word "dive" into "dive", so it can be counted as the same word with other forms

SQL word root matching

阅读更多关于 SQL word root matching

问题 I'm wondering whether major SQL engines out there (MS SQL, Oracle, MySQL) have the ability to understand that 2 words are related because they share the same root. We know it's easy to match "networking" when searching for "network" because the latter is a substring of the former. But do SQL engines have functions that can match "network" when searching for "networking"? Thanks a lot. 回答1: This functionality is called a stemmer: an algorithm that can deduce a stem from any form of the word.

Wordpiece tokenization versus conventional lemmatization?

阅读更多关于 Wordpiece tokenization versus conventional lemmatization?

问题 I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing"). Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece

OpenNLP: Unable to locate the model file for Lemmatizer

阅读更多关于 OpenNLP: Unable to locate the model file for Lemmatizer

问题 Summary : Unable to find the model file used for Lemmatizer (english-lemmatizer.bin) Details : OpenNLP Tools Models appears to be a comprehensive repository for the various models used by the different components of the Apache OpenNLP library. However, I am unable to find the model file en-lemmatizer.bin , which is used with the lemmatizer. The Apache OpenNLP Developer Manual provides the following code snippet for the Lemmatization step: InputStream dictLemmatizer = null; try (dictLemmatizer

Why NLTK lemmatization has wrong output even if verb.exc has added right value?

阅读更多关于 Why NLTK lemmatization has wrong output even if verb.exc has added right value?

来源： https://stackoverflow.com/questions/33594721/why-nltk-lemmatization-has-wrong-output-even-if-verb-exc-has-added-right-value

Why NLTK lemmatization has wrong output even if verb.exc has added right value?

阅读更多关于 Why NLTK lemmatization has wrong output even if verb.exc has added right value?

来源： https://stackoverflow.com/questions/33594721/why-nltk-lemmatization-has-wrong-output-even-if-verb-exc-has-added-right-value