nlp

Why are stop words not being excluded from the word cloud when using Python's wordcloud library?

倾然丶 夕夏残阳落幕 提交于 2020-06-28 04:03:54
问题 I want to exclude 'The', 'They' and 'My' from being displayed in my word cloud. I'm using the python library 'wordcloud' as below, and updating the STOPWORDS list with these 3 additional stopwords, but the wordcloud is still including them. What do I need to change so that these 3 words are excluded? The libraries I imported are: import numpy as np import pandas as pd from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt I've tried adding elements to the STOPWORDS set at

Patterns with ENT_TYPE from manually labelled Span not working

半世苍凉 提交于 2020-06-28 04:03:51
问题 As an alternative to accomplishing this: Patterns with multi-terms entries in the IN attribute I wrote the following code to match phrases, label them, and then use them in EntityRuler patterns: # %% import spacy from spacy.matcher import PhraseMatcher from spacy.pipeline import EntityRuler from spacy.tokens import Span class PhraseRuler(object): name = 'phrase_ruler' def __init__(self, nlp, terms, label): patterns = [nlp(term) for term in terms] self.matcher = PhraseMatcher(nlp.vocab) self

Add attention layer to Seq2Seq model

混江龙づ霸主 提交于 2020-06-27 17:33:09
问题 I have build a Seq2Seq model of encoder-decoder. I want to add an attention layer to it. I tried adding attention layer through this but it didn't help. Here is my initial code without attention # Encoder encoder_inputs = Input(shape=(None,)) enc_emb = Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs) encoder_lstm = LSTM(latent_dim, return_state=True) encoder_outputs, state_h, state_c = encoder_lstm(enc_emb) # We discard `encoder_outputs` and only keep the states.

merge nearly similar rows with help of spacy

≡放荡痞女 提交于 2020-06-27 17:02:04
问题 I want to merge some rows if they are nearly similar. Similarity can be checked by using spaCy. df: string yellow color yellow color looks like yellow color bright red color okay red color blood output: string yellow color looks like bright red color okay blood solution: brute force approach is - for every item in string check similarity with other n-1 item if greater than some threshold value then merge. Is there any other approach ? As i am not in contact with much people idk how they do it

merge nearly similar rows with help of spacy

倾然丶 夕夏残阳落幕 提交于 2020-06-27 17:01:13
问题 I want to merge some rows if they are nearly similar. Similarity can be checked by using spaCy. df: string yellow color yellow color looks like yellow color bright red color okay red color blood output: string yellow color looks like bright red color okay blood solution: brute force approach is - for every item in string check similarity with other n-1 item if greater than some threshold value then merge. Is there any other approach ? As i am not in contact with much people idk how they do it

finding the POS of the root of a noun_chunk with spacy

人盡茶涼 提交于 2020-06-27 06:06:29
问题 When using spacy you can easily loop across the noun_phrases of a text as follows: S='This is an example sentence that should include several parts and also make clear that studying Natural language Processing is not difficult' nlp = spacy.load('en_core_web_sm') doc = nlp(S) [chunk.text for chunk in doc.noun_chunks] # = ['an example sentence', 'several parts', 'Natural language Processing'] You can also get the "root" of the noun chunk: [chunk.root.text for chunk in doc.noun_chunks] # = [

Extract entities from Multiple Subject passive sentence by Spacy

坚强是说给别人听的谎言 提交于 2020-06-27 04:33:20
问题 Using Python Spacy, I am trying to extract entities from multiple subject passive voice sentence. Sentence = "John and Jenny were accused of crimes by David" My intention is to extract both "John and Jenny” from the sentence as nsubjpass and .ent_ . However, I am only able to extract “John” as nsubjpass. How to extract both them? Notice that while John is found as an entity in .ents, Jenny is considered as conj instead of nsubjpass. How to improve it? code each_sentence3 = "John and Jenny

Glove Word Embeddings supported languages

梦想与她 提交于 2020-06-26 13:44:26
问题 I started experimenting with word embeddings, and I found some results which I don't know how to interpret. I first used an English corpus for both training and testing and afterwards, I used the English corpus for training and a small French corpus for testing (all corpora have been annotated for the same binary classification task). In both cases, I used the pre-trained on tweets Glove embeddings. As the results in the case where I also used the French corpus improved (by almost 5%,

Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?

不羁的心 提交于 2020-06-25 02:38:07
问题 In the Keras docs for Embedding https://keras.io/layers/embeddings/, the explanation given for mask_zero is mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input

Extracting Key-Phrases from text based on the Topic with Python

冷暖自知 提交于 2020-06-24 14:57:09
问题 I have a large dataset with 3 columns, columns are text, phrase and topic. I want to find a way to extract key-phrases (phrases column) based on the topic. Key-Phrase can be part of the text value or the whole text value. import pandas as pd text = ["great game with a lot of amazing goals from both teams", "goalkeepers from both teams made misteke", "he won all four grand slam championchips", "the best player from three-point line", "Novak Djokovic is the best player of all time", "amazing