nlp

How to train a model that will result in the similarity score between two news titles?

自作多情 提交于 2020-07-22 21:40:04
问题 I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column. I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has

How to train a model that will result in the similarity score between two news titles?

♀尐吖头ヾ 提交于 2020-07-22 21:38:38
问题 I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column. I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has

How to train a model that will result in the similarity score between two news titles?

人盡茶涼 提交于 2020-07-22 21:38:20
问题 I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column. I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has

Added layer must be an instance of class layer

放肆的年华 提交于 2020-07-22 05:51:08
问题 I am building a Bi-LSTM network and I have included an attention layer in it. But it is giving an error that added layer must be an instance of class layer. Some of the libraries which I have imported are from keras.models import Model, Sequential from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, Bidirectional, Conv1D, Flatten, GlobalMaxPooling1D, SpatialDropout1D from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras import backend

How to treat numbers inside text strings when vectorizing words?

别来无恙 提交于 2020-07-18 11:34:37
问题 If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers? I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character? Does converting numbers to strings weakens the information i

How are the TokenEmbeddings in BERT created?

白昼怎懂夜的黑 提交于 2020-07-08 22:35:49
问题 In the paper describing BERT, there is this paragraph about WordPiece Embeddings. We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them

Removing named entities from a document using spacy

自闭症网瘾萝莉.ら 提交于 2020-07-08 20:36:15
问题 I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error. In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame. I would appreciate any kind of help and

Removing named entities from a document using spacy

霸气de小男生 提交于 2020-07-08 20:33:26
问题 I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error. In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame. I would appreciate any kind of help and

How to identify Abbreviations/Acronyms and expand them in spaCy?

折月煮酒 提交于 2020-07-08 17:10:39
问题 I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the expanded abbreviation ( i.e. MS -> Microsoft ) and then replacing that with the full expanded abbreviation + abbreviation ( i.e. Microsoft -> Microsoft (MS) or MS -> Microsoft (MS) ). I am very new to spaCy, so my naive approach was going to be to use spacy_lookup and use both the abbreviation and the expanded

Spacy lemmatization of a single word

99封情书 提交于 2020-07-07 07:32:15
问题 I am trying to get the lemmatized version of a single word. Is there a way using "spacy" (fantastic python NLP library) to do this. Below is the code I have tried but this does not work): from spacy.lemmatizer import Lemmatizer from spacy.lookups import Lookups lookups = Lookups() lemmatizer = Lemmatizer(lookups) word = "ducks" lemmas = lemmatizer.lookup(word) print(lemmas) The result I was hoping for was that the word "ducks" (plural) would result in "duck" (singular). Unfortunately, "ducks"