nlp | 易学教程

How to train a model that will result in the similarity score between two news titles?

阅读更多关于 How to train a model that will result in the similarity score between two news titles?

问题 I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column. I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has

How to train a model that will result in the similarity score between two news titles?

阅读更多关于 How to train a model that will result in the similarity score between two news titles?

How to train a model that will result in the similarity score between two news titles?

阅读更多关于 How to train a model that will result in the similarity score between two news titles?

Added layer must be an instance of class layer

阅读更多关于 Added layer must be an instance of class layer

问题 I am building a Bi-LSTM network and I have included an attention layer in it. But it is giving an error that added layer must be an instance of class layer. Some of the libraries which I have imported are from keras.models import Model, Sequential from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, Bidirectional, Conv1D, Flatten, GlobalMaxPooling1D, SpatialDropout1D from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras import backend

How to treat numbers inside text strings when vectorizing words?

阅读更多关于 How to treat numbers inside text strings when vectorizing words?

问题 If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers? I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character? Does converting numbers to strings weakens the information i

How are the TokenEmbeddings in BERT created?

阅读更多关于 How are the TokenEmbeddings in BERT created?

问题 In the paper describing BERT, there is this paragraph about WordPiece Embeddings. We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them

Removing named entities from a document using spacy

阅读更多关于 Removing named entities from a document using spacy

问题 I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error. In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame. I would appreciate any kind of help and

Removing named entities from a document using spacy

阅读更多关于 Removing named entities from a document using spacy

How to identify Abbreviations/Acronyms and expand them in spaCy?

阅读更多关于 How to identify Abbreviations/Acronyms and expand them in spaCy?

问题 I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the expanded abbreviation ( i.e. MS -> Microsoft ) and then replacing that with the full expanded abbreviation + abbreviation ( i.e. Microsoft -> Microsoft (MS) or MS -> Microsoft (MS) ). I am very new to spaCy, so my naive approach was going to be to use spacy_lookup and use both the abbreviation and the expanded

Spacy lemmatization of a single word

阅读更多关于 Spacy lemmatization of a single word

问题 I am trying to get the lemmatized version of a single word. Is there a way using "spacy" (fantastic python NLP library) to do this. Below is the code I have tried but this does not work): from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lemmatizer = Lemmatizer(lookups) word = "ducks" lemmas = lemmatizer.lookup(word) print(lemmas) The result I was hoping for was that the word "ducks" (plural) would result in "duck" (singular). Unfortunately, "ducks"