scikit-learn | 易学教程

ML model is failing to impute values

阅读更多关于 ML model is failing to impute values

问题 I've tried creating an ML model to make some predictions, but I keep running into a stumbling block. Namely, the code seems to be ignoring the imputation instructions I give it, resulting in the following error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). Here's my code: import pandas as pd import numpy as np from sklearn.ensemble import AdaBoostRegressor from category_encoders import CatBoostEncoder from sklearn.compose import make_column_transformer

ML model is failing to impute values

阅读更多关于 ML model is failing to impute values

Dimension mismatch when I try to apply tf-idf to test set

阅读更多关于 Dimension mismatch when I try to apply tf-idf to test set

问题 I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier What I have tried now is the following: def test_tfidf(data, ngrams = 1): df_temp = data.copy(deep = True) df_temp = basic_preprocessing(df_temp) tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams)) tfidf_vectorizer.fit(df_temp['Text']) list_corpus = df_temp["Text"].tolist() list_labels = df_temp["Label"].tolist() X = tfidf_vectorizer.transform(list_corpus) return X,

Dimension mismatch when I try to apply tf-idf to test set

阅读更多关于 Dimension mismatch when I try to apply tf-idf to test set

Why is my confusion matrix returning only one number?

阅读更多关于 Why is my confusion matrix returning only one number?

问题 I'm doing a binary classification. Whenever my prediction equals the ground truth, I find sklearn.metrics.confusion_matrix to return a single value. Isn't there a problem? from sklearn.metrics import confusion_matrix print(confusion_matrix([True, True], [True, True]) # [[2]] I would expect something like: [[2 0] [0 0]] 回答1: Solution: Should you want to have the desired output, you should fill-in labels=[True, False] : from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_true=

Why is my confusion matrix returning only one number?

阅读更多关于 Why is my confusion matrix returning only one number?

Why does shuffling training data affect my random forest classifier's accuracy?

阅读更多关于 Why does shuffling training data affect my random forest classifier's accuracy?

问题 The same question has been asked. But since the OP didn't post the code, not much helpful information was given. I'm having basically the same problem, where for some reason shuffling data is making a big accuracy gain (from 45% to 94%!) to my random forest classifier. (In my case removing duplicates also affected the accuracy, but that may be a discussion for another day) Based on my understanding on how RF algorithm works, this really should not happen. My data are merged from several files

how to view tf-idf score against each word

阅读更多关于 how to view tf-idf score against each word

问题 I was trying to know the tf-idf scores of each word in my document. However, it only returns values in the matrix but I see a specific type of representation of tf-idf scores against each word. I have used processed and the code works however I want to change the way it is presented: code: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer bow_transformer = CountVectorizer(analyzer=text_process).fit(df["comments"].head())

KNN for Text Classification using TF-IDF scores

阅读更多关于 KNN for Text Classification using TF-IDF scores

问题 I have a CSV file (corpus.csv) with graded abstracts (text) in the following format in corpus: Institute, Score, Abstract ---------------------------------------------------------------------- UoM, 3.0, Hello, this is abstract one UoM, 3.2, Hello, this is abstract two and yet counting. UoE, 3.1, Hello, yet another abstract but this is a unique one. UoE, 2.2, Hello, please no more abstract. I am trying to create a KNN classification program in python, which is able to get an user input

KNN for Text Classification using TF-IDF scores

阅读更多关于 KNN for Text Classification using TF-IDF scores