scikit-learn

ML model is failing to impute values

人盡茶涼 提交于 2020-12-15 06:08:41
问题 I've tried creating an ML model to make some predictions, but I keep running into a stumbling block. Namely, the code seems to be ignoring the imputation instructions I give it, resulting in the following error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). Here's my code: import pandas as pd import numpy as np from sklearn.ensemble import AdaBoostRegressor from category_encoders import CatBoostEncoder from sklearn.compose import make_column_transformer

ML model is failing to impute values

送分小仙女□ 提交于 2020-12-15 06:08:30
问题 I've tried creating an ML model to make some predictions, but I keep running into a stumbling block. Namely, the code seems to be ignoring the imputation instructions I give it, resulting in the following error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). Here's my code: import pandas as pd import numpy as np from sklearn.ensemble import AdaBoostRegressor from category_encoders import CatBoostEncoder from sklearn.compose import make_column_transformer

Dimension mismatch when I try to apply tf-idf to test set

巧了我就是萌 提交于 2020-12-15 04:24:48
问题 I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier What I have tried now is the following: def test_tfidf(data, ngrams = 1): df_temp = data.copy(deep = True) df_temp = basic_preprocessing(df_temp) tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams)) tfidf_vectorizer.fit(df_temp['Text']) list_corpus = df_temp["Text"].tolist() list_labels = df_temp["Label"].tolist() X = tfidf_vectorizer.transform(list_corpus) return X,

Dimension mismatch when I try to apply tf-idf to test set

a 夏天 提交于 2020-12-15 04:24:12
问题 I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier What I have tried now is the following: def test_tfidf(data, ngrams = 1): df_temp = data.copy(deep = True) df_temp = basic_preprocessing(df_temp) tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams)) tfidf_vectorizer.fit(df_temp['Text']) list_corpus = df_temp["Text"].tolist() list_labels = df_temp["Label"].tolist() X = tfidf_vectorizer.transform(list_corpus) return X,

Why is my confusion matrix returning only one number?

。_饼干妹妹 提交于 2020-12-13 06:31:57
问题 I'm doing a binary classification. Whenever my prediction equals the ground truth, I find sklearn.metrics.confusion_matrix to return a single value. Isn't there a problem? from sklearn.metrics import confusion_matrix print(confusion_matrix([True, True], [True, True]) # [[2]] I would expect something like: [[2 0] [0 0]] 回答1: Solution: Should you want to have the desired output, you should fill-in labels=[True, False] : from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_true=

Why is my confusion matrix returning only one number?

天大地大妈咪最大 提交于 2020-12-13 06:30:31
问题 I'm doing a binary classification. Whenever my prediction equals the ground truth, I find sklearn.metrics.confusion_matrix to return a single value. Isn't there a problem? from sklearn.metrics import confusion_matrix print(confusion_matrix([True, True], [True, True]) # [[2]] I would expect something like: [[2 0] [0 0]] 回答1: Solution: Should you want to have the desired output, you should fill-in labels=[True, False] : from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_true=

Why does shuffling training data affect my random forest classifier's accuracy?

百般思念 提交于 2020-12-13 06:08:46
问题 The same question has been asked. But since the OP didn't post the code, not much helpful information was given. I'm having basically the same problem, where for some reason shuffling data is making a big accuracy gain (from 45% to 94%!) to my random forest classifier. (In my case removing duplicates also affected the accuracy, but that may be a discussion for another day) Based on my understanding on how RF algorithm works, this really should not happen. My data are merged from several files

how to view tf-idf score against each word

我是研究僧i 提交于 2020-12-13 05:56:40
问题 I was trying to know the tf-idf scores of each word in my document. However, it only returns values in the matrix but I see a specific type of representation of tf-idf scores against each word. I have used processed and the code works however I want to change the way it is presented: code: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer bow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())

KNN for Text Classification using TF-IDF scores

对着背影说爱祢 提交于 2020-12-13 03:59:09
问题 I have a CSV file (corpus.csv) with graded abstracts (text) in the following format in corpus: Institute, Score, Abstract ---------------------------------------------------------------------- UoM, 3.0, Hello, this is abstract one UoM, 3.2, Hello, this is abstract two and yet counting. UoE, 3.1, Hello, yet another abstract but this is a unique one. UoE, 2.2, Hello, please no more abstract. I am trying to create a KNN classification program in python, which is able to get an user input

KNN for Text Classification using TF-IDF scores

拜拜、爱过 提交于 2020-12-13 03:58:05
问题 I have a CSV file (corpus.csv) with graded abstracts (text) in the following format in corpus: Institute, Score, Abstract ---------------------------------------------------------------------- UoM, 3.0, Hello, this is abstract one UoM, 3.2, Hello, this is abstract two and yet counting. UoE, 3.1, Hello, yet another abstract but this is a unique one. UoE, 2.2, Hello, please no more abstract. I am trying to create a KNN classification program in python, which is able to get an user input