nlp | 易学教程

Why are stop words not being excluded from the word cloud when using Python's wordcloud library?

阅读更多关于 Why are stop words not being excluded from the word cloud when using Python's wordcloud library?

问题 I want to exclude 'The', 'They' and 'My' from being displayed in my word cloud. I'm using the python library 'wordcloud' as below, and updating the STOPWORDS list with these 3 additional stopwords, but the wordcloud is still including them. What do I need to change so that these 3 words are excluded? The libraries I imported are: import numpy as np import pandas as pd from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt I've tried adding elements to the STOPWORDS set at

Patterns with ENT_TYPE from manually labelled Span not working

阅读更多关于 Patterns with ENT_TYPE from manually labelled Span not working

问题 As an alternative to accomplishing this: Patterns with multi-terms entries in the IN attribute I wrote the following code to match phrases, label them, and then use them in EntityRuler patterns: # %% import spacy from spacy.matcher import PhraseMatcher from spacy.pipeline import EntityRuler from spacy.tokens import Span class PhraseRuler(object): name = 'phrase_ruler' def __init__(self, nlp, terms, label): patterns = [nlp(term) for term in terms] self.matcher = PhraseMatcher(nlp.vocab) self

Add attention layer to Seq2Seq model

阅读更多关于 Add attention layer to Seq2Seq model

问题 I have build a Seq2Seq model of encoder-decoder. I want to add an attention layer to it. I tried adding attention layer through this but it didn't help. Here is my initial code without attention # Encoder encoder_inputs = Input(shape=(None,)) enc_emb = Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs) encoder_lstm = LSTM(latent_dim, return_state=True) encoder_outputs, state_h, state_c = encoder_lstm(enc_emb) # We discard `encoder_outputs` and only keep the states.

merge nearly similar rows with help of spacy

阅读更多关于 merge nearly similar rows with help of spacy

问题 I want to merge some rows if they are nearly similar. Similarity can be checked by using spaCy. df: string yellow color yellow color looks like yellow color bright red color okay red color blood output: string yellow color looks like bright red color okay blood solution: brute force approach is - for every item in string check similarity with other n-1 item if greater than some threshold value then merge. Is there any other approach ? As i am not in contact with much people idk how they do it

merge nearly similar rows with help of spacy

阅读更多关于 merge nearly similar rows with help of spacy

finding the POS of the root of a noun_chunk with spacy

阅读更多关于 finding the POS of the root of a noun_chunk with spacy

问题 When using spacy you can easily loop across the noun_phrases of a text as follows: S='This is an example sentence that should include several parts and also make clear that studying Natural language Processing is not difficult' nlp = spacy.load('en_core_web_sm') doc = nlp(S) [chunk.text for chunk in doc.noun_chunks] # = ['an example sentence', 'several parts', 'Natural language Processing'] You can also get the "root" of the noun chunk: [chunk.root.text for chunk in doc.noun_chunks] # = [

Extract entities from Multiple Subject passive sentence by Spacy

阅读更多关于 Extract entities from Multiple Subject passive sentence by Spacy

问题 Using Python Spacy, I am trying to extract entities from multiple subject passive voice sentence. Sentence = "John and Jenny were accused of crimes by David" My intention is to extract both "John and Jenny” from the sentence as nsubjpass and .ent_ . However, I am only able to extract “John” as nsubjpass. How to extract both them? Notice that while John is found as an entity in .ents, Jenny is considered as conj instead of nsubjpass. How to improve it? code each_sentence3 = "John and Jenny

Glove Word Embeddings supported languages

阅读更多关于 Glove Word Embeddings supported languages

问题 I started experimenting with word embeddings, and I found some results which I don't know how to interpret. I first used an English corpus for both training and testing and afterwards, I used the English corpus for training and a small French corpus for testing (all corpora have been annotated for the same binary classification task). In both cases, I used the pre-trained on tweets Glove embeddings. As the results in the case where I also used the French corpus improved (by almost 5%,

Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?

阅读更多关于 Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?

问题 In the Keras docs for Embedding https://keras.io/layers/embeddings/, the explanation given for mask_zero is mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input

Extracting Key-Phrases from text based on the Topic with Python

阅读更多关于 Extracting Key-Phrases from text based on the Topic with Python

问题 I have a large dataset with 3 columns, columns are text, phrase and topic. I want to find a way to extract key-phrases (phrases column) based on the topic. Key-Phrase can be part of the text value or the whole text value. import pandas as pd text = ["great game with a lot of amazing goals from both teams", "goalkeepers from both teams made misteke", "he won all four grand slam championchips", "the best player from three-point line", "Novak Djokovic is the best player of all time", "amazing