nlp

How to transform multiple features in a PipeLine using FeatureUnion?

南楼画角 提交于 2020-06-16 06:15:55
问题 I have a pandas data frame that contains information about messages sent by user. For my model, I'm interested in predicting missing recipients of a message i,e given recipients A,B,C of a message I want to predict who else should have been part of the recipients. I'm doing multi-label classification using OneVsRestClassifier and LinearSVC. For features, I want to use the recipients of the message. subject and body. Since recipients is a list of users, I want to transform that column using

How to transform multiple features in a PipeLine using FeatureUnion?

醉酒当歌 提交于 2020-06-16 06:15:30
问题 I have a pandas data frame that contains information about messages sent by user. For my model, I'm interested in predicting missing recipients of a message i,e given recipients A,B,C of a message I want to predict who else should have been part of the recipients. I'm doing multi-label classification using OneVsRestClassifier and LinearSVC. For features, I want to use the recipients of the message. subject and body. Since recipients is a list of users, I want to transform that column using

Inverse Document Frequency Formula

谁说胖子不能爱 提交于 2020-06-15 07:25:38
问题 I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect. I keep reading that idf(term) = log(# of docs/ # of docs with term) If so, won't you get a divide by zero error if there are no docs with the term? To solve that problem, I read that you do log (# of docs / # of docs with term + 1 ) But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me. What am

Computing TF-IDF on the whole dataset or only on training data?

人走茶凉 提交于 2020-06-13 18:45:45
问题 In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test? 回答1: I have not read the book and I am not sure whether this is actually a mistake in the book

Difference between max length of word ngrams and size of context window

若如初见. 提交于 2020-06-13 08:47:45
问题 In the description of the fasttext library for python https://github.com/facebookresearch/fastText/tree/master/python for training a supervised model there are different arguments, where among others are stated as: ws : size of the context window wordNgrams : max length of word ngram. If I understand it right, both of them are responsible for taking into account the surrounding words of the word, but what is the clear difference between them? 回答1: First, we use the train_unsupervised API to

eli5: show_weights() with two labels

这一生的挚爱 提交于 2020-06-13 06:00:31
问题 I'm trying eli5 in order to understand the contribution of terms to the prediction of certain classes. You can run this script: import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.datasets import fetch_20newsgroups #categories = ['alt.atheism', 'soc.religion.christian'] categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics'] np.random.seed(1) train

BERT sentence embedding by summing last 4 layers

末鹿安然 提交于 2020-06-11 07:55:27
问题 I used Chris Mccormick tutorial on BERT using pytorch-pretained-bert to get a sentence embedding as follows: tokenized_text = tokenizer.tokenize(marked_text) indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) segments_ids = [1] * len(tokenized_text) tokens_tensor = torch.tensor([indexed_tokens]) segments_tensors = torch.tensor([segments_ids]) model = BertModel.from_pretrained('bert-base-uncased') model.eval() with torch.no_grad(): encoded_layers, _ = model(tokens_tensor,

Function call stack: keras_scratch_graph Error

筅森魡賤 提交于 2020-06-10 10:44:38
问题 I am reimplementing a text2speech project. I am facing a Function call stack : keras_scratch_graph error in decoder part. The network architecture is from Deep Voice 3 paper. I am using keras from TF 2.0 on Google Colab. Below is the code for Decoder Keras Model. y1 = tf.ones(shape = (16, 203, 320)) def Decoder(name = "decoder"): # Decoder Prenet din = tf.concat((tf.zeros_like(y1[:, :1, -hp.mel:]), y1[:, :-1, -hp.mel:]), 1) keys = K.Input(shape = (180, 256), batch_size = 16, name = "keys")

Function call stack: keras_scratch_graph Error

为君一笑 提交于 2020-06-10 10:44:06
问题 I am reimplementing a text2speech project. I am facing a Function call stack : keras_scratch_graph error in decoder part. The network architecture is from Deep Voice 3 paper. I am using keras from TF 2.0 on Google Colab. Below is the code for Decoder Keras Model. y1 = tf.ones(shape = (16, 203, 320)) def Decoder(name = "decoder"): # Decoder Prenet din = tf.concat((tf.zeros_like(y1[:, :1, -hp.mel:]), y1[:, :-1, -hp.mel:]), 1) keys = K.Input(shape = (180, 256), batch_size = 16, name = "keys")

How to get probability of prediction per entity from Spacy NER model?

て烟熏妆下的殇ゞ 提交于 2020-06-10 07:14:11
问题 I used this official example code to train a NER model from scratch using my own training samples. When I predict using this model on new text, I want to get the probability of prediction of each entity. # test the saved model print("Loading from", output_dir) nlp2 = spacy.load(output_dir) for text, _ in TRAIN_DATA: doc = nlp2(text) print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) I am unable to find a method in