nltk

Semantic Similarity across multiple languages

感情迁移 提交于 2020-01-05 05:36:06
问题 I am using word embeddings for finding similarity between two sentences. Using word2vec, I also get a similarity measure if one sentence is in English and the other one in Dutch (though not very good). So I started wondering if it's possible to compute the similarity between two sentences in two different languages (without an explicit translation), especially if the languages have some similarities (Englis/Dutch)? 回答1: Let's assume that your sentence-similarity scheme uses only word-vectors

Lemmatizing words after POS tagging produces unexpected results

老子叫甜甜 提交于 2020-01-05 03:54:05
问题 I am using python3.5 with the nltk pos_tag function and the WordNetLemmatizer. My goal is to flatten words in our database to classify text. I am trying to test using the lemmatizer and I encounter strange behavior when using the POS tagger on identical tokens. In the example below, I have a list of three strings and when running them in the POS tagger every other element is returned as a noun(NN) and the rest are return as verbs (VBG). This affects the lemmatization. The out put looks like

Mapping Wordnet Senses to Verbnet

血红的双手。 提交于 2020-01-04 15:32:27
问题 http://digital.library.unt.edu/ark:/67531/metadc30973/m2/1/high_res_d/Mihalcea-2005-Putting_Pieces_Together-Combining_FrameNet.pdf In the link above on the sixth page, the paper mentions that a mapping was made. "The process of mapping VerbNet to WordNet is thus semi-automatic. We first manually link all semantic constraints defined in VerbNet (there are 36 such constraints) to one or more nodes in the WordNet semantic hierarchy." I am trying to use this mapping on NLTK Python with Verbnet

lemmatize plural nouns using nltk and wordnet

喜你入骨 提交于 2020-01-04 02:04:17
问题 I want to lemmatize using from nltk import word_tokenize, sent_tokenize, pos_tag from nltk.stem.wordnet import WordNetLemmatizer from nltk.corpus import wordnet lmtzr = WordNetLemmatizer() POS = pos_tag(text) def get_wordnet_pos(treebank_tag): #maps pos tag so lemmatizer understands from nltk.corpus import wordnet if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag

Extracting words from txt file using python

こ雲淡風輕ζ 提交于 2020-01-04 01:56:10
问题 I want to extract all the words that are between single quotation marks from a text file. The text file looks like this: u'MMA': 10, =u'acrylic'= : 19, == u'acting lessons': 2, =u'aerobic': 141, =u'alto': 2= 4, =u&#= 39;art therapy': 4, =u'ballet': 939, =u'ballroom'= ;: 234, = =u'banjo': 38, And ideally, my output would look lie this: MMA, acrylic, acting lessons, ... From browsing posts, it seems like I should use some combination of NLTK / regex for python to accomplish this. I've tried the

How can I get the stanford NLTK python module?

≯℡__Kan透↙ 提交于 2020-01-03 18:57:46
问题 I have the python (2.7.5) and python-nltk packages installed in Ubuntu 13.10. Running apt-cache policy python-nltk returns: python-nltk: Installed: 2.0~b9-0ubuntu4 And according to the Stanford site, 2.0+ should have the stanford module. Yet when I try to import it, I get an error: >>> import nltk.tag.stanford Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named stanford How can I get the stanford module? (Preferably through the usual

How can I get the stanford NLTK python module?

北城余情 提交于 2020-01-03 18:57:06
问题 I have the python (2.7.5) and python-nltk packages installed in Ubuntu 13.10. Running apt-cache policy python-nltk returns: python-nltk: Installed: 2.0~b9-0ubuntu4 And according to the Stanford site, 2.0+ should have the stanford module. Yet when I try to import it, I get an error: >>> import nltk.tag.stanford Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named stanford How can I get the stanford module? (Preferably through the usual

NLTK words lemmatizing

强颜欢笑 提交于 2020-01-03 17:23:32
问题 I am trying to do lemmatization on words with NLTK . What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement". When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer() , it returns "acknowledg" rather than "acknowledge". Can anyone tell me how to eliminate the affixes of words? Say, when

Get gender from noun using NLTK with German corpora

£可爱£侵袭症+ 提交于 2020-01-03 17:11:10
问题 I'm experimenting with NTLK. My question is if the library can detect the gender of a noun in German. I want to receive this information in order to determine if a text is written gender neutral. See here for more information: https://en.wikipedia.org/wiki/Gender_neutrality_in_languages_with_grammatical_gender The underlying code categorizes my sentence, but I can't see any information about the gender of "Mitarbeiter" . My code so far: sentence = """Der Mitarbeiter geht.""" tokens = nltk

Get gender from noun using NLTK with German corpora

拟墨画扇 提交于 2020-01-03 17:10:22
问题 I'm experimenting with NTLK. My question is if the library can detect the gender of a noun in German. I want to receive this information in order to determine if a text is written gender neutral. See here for more information: https://en.wikipedia.org/wiki/Gender_neutrality_in_languages_with_grammatical_gender The underlying code categorizes my sentence, but I can't see any information about the gender of "Mitarbeiter" . My code so far: sentence = """Der Mitarbeiter geht.""" tokens = nltk