scikit-learn

Is the predict_proba method of scikit learn's SGDClassifier thread safe?

☆樱花仙子☆ 提交于 2021-01-28 19:40:51
问题 I would like to expose a model built using sklearn.linear_model.SGDClassifier through a web API. Every web request would call into the predict_proba method of the model, however I will have just one instance of the model in the process, due to performance and consistency reasons; it would get created when the web application starts and start serving requests once the training completes. This raises the question - is the predict_proba method of the model actually thread safe? Any help will be

Unable to do Stacking for a Multi-label classifier

 ̄綄美尐妖づ 提交于 2021-01-28 19:12:39
问题 I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking. Text features : HashingVectorizer (number of features 2**20, char analyzer) TSVD to reduce the dimensionality (n_components=200). text_pipeline = Pipeline([ ('hashing_vectorizer', HashingVectorizer(n_features=2**20, analyzer='char'))

Why RandomForestClassifier on CPU (using SKLearn) and on GPU (using RAPIDs) get differents scores, very different?

拟墨画扇 提交于 2021-01-28 18:42:18
问题 I am using RandomForestClassifier on CPU with SKLearn and on GPU using RAPIDs. I am doing a benchmark between these two libraries about speed up and scoring using Iris dataset (it is a try, in the future, I will change the dataset for a better benchmarking, I am starting with these two libraries). The problem is when I measure the score on CPU always get a value of 1.0 but when I try to measure the score on GPU I get a variable value between 0.2 and 1.0 and I do not understand why could be it

How to get the feature names in a different pipeline in sklearn in python

落爺英雄遲暮 提交于 2021-01-28 18:05:38
问题 I am using the following code (source) to concatenate multiple feature extraction methods. from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.feature_selection import SelectKBest iris = load_iris() X, y = iris.data, iris.target pca = PCA(n_components=2) selection = SelectKBest(k=1) # Build estimator from PCA and Univariate

sklearn use RandomizedSearchCV with custom metrics and catch Exceptions

為{幸葍}努か 提交于 2021-01-28 12:35:42
问题 I am using the RandomizedSearchCV function in sklearn with a Random Forest Classifier. To see different metrics i am using a custom scoring from sklearn.metrics import make_scorer, roc_auc_score, recall_score, matthews_corrcoef, balanced_accuracy_score, accuracy_score acc = make_scorer(accuracy_score) auc_score = make_scorer(roc_auc_score) recall = make_scorer(recall_score) mcc = make_scorer(matthews_corrcoef) bal_acc = make_scorer(balanced_accuracy_score) scoring = {"roc_auc_score": auc

How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)?

蹲街弑〆低调 提交于 2021-01-28 12:12:49
问题 In Python sklearn KMeans (see documentation), I was wondering what happens internally when passing an ndarray of shape (n, n_features) to the init parameter, When n<n_clusters Does it drop the given centroids and just starts a kmeans++ initialization which is the default choice for the init parameter ? (PDF paper kmeans++) (How does Kmeans++ work) Does it consider the given centroids and fill accordingly the remaining centroids using kmeans++ ? Does it consider the given centroids and fill

ModuleNotFoundError: No module named 'xgboost.sklearn'

こ雲淡風輕ζ 提交于 2021-01-28 11:54:12
问题 I'm trying to import xgboost into jupyter-notebook but get the following error: --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-9-a585b270d0df> in <module> 1 import pandas as pd 2 import numpy as np ----> 3 import xgboost ~/.local/lib/python3.6/site-packages/xgboost/__init__.py in <module> 14 from . import tracker # noqa 15 from .tracker import RabitTracker # noqa ---> 16 from . import dask 17 try

Python package SHAP import

筅森魡賤 提交于 2021-01-28 11:52:39
问题 I installed Python package shap for plotting. conda install -c conda-forge shap After installing, I import shap in jupyter notebook but got error. import shap --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-132-efbb001a1501> in <module> ----> 1 import shap ~\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\__init__.py in <module> 3 __version__ = '0.29.3' 4 ----> 5 from .explainers.kernel import

max_value and min_value for each column in scikit IterativeImputer

雨燕双飞 提交于 2021-01-28 11:47:13
问题 I have this data set with 78 columns and 5707 rows. Almost every column has missing values and I would like to impute them with IterativeImputer. If I understood it correctly, it will make a "smarter" imputation on each column based on the information from other columns. However, when imputing, I do not want the imputed values to be less than the observed minimum or more than the observed maximum. I realize there are max_value and min_value parameters, but I do not want to impose a "global"

max_value and min_value for each column in scikit IterativeImputer

雨燕双飞 提交于 2021-01-28 11:39:49
问题 I have this data set with 78 columns and 5707 rows. Almost every column has missing values and I would like to impute them with IterativeImputer. If I understood it correctly, it will make a "smarter" imputation on each column based on the information from other columns. However, when imputing, I do not want the imputed values to be less than the observed minimum or more than the observed maximum. I realize there are max_value and min_value parameters, but I do not want to impose a "global"