scikit-learn | 易学教程

Is the predict_proba method of scikit learn's SGDClassifier thread safe?

阅读更多关于 Is the predict_proba method of scikit learn's SGDClassifier thread safe?

问题 I would like to expose a model built using sklearn.linear_model.SGDClassifier through a web API. Every web request would call into the predict_proba method of the model, however I will have just one instance of the model in the process, due to performance and consistency reasons; it would get created when the web application starts and start serving requests once the training completes. This raises the question - is the predict_proba method of the model actually thread safe? Any help will be

Unable to do Stacking for a Multi-label classifier

阅读更多关于 Unable to do Stacking for a Multi-label classifier

问题 I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking. Text features : HashingVectorizer (number of features 2**20, char analyzer) TSVD to reduce the dimensionality (n_components=200). text_pipeline = Pipeline([ ('hashing_vectorizer', HashingVectorizer(n_features=2**20, analyzer='char'))

Why RandomForestClassifier on CPU (using SKLearn) and on GPU (using RAPIDs) get differents scores, very different?

阅读更多关于 Why RandomForestClassifier on CPU (using SKLearn) and on GPU (using RAPIDs) get differents scores, very different?

问题 I am using RandomForestClassifier on CPU with SKLearn and on GPU using RAPIDs. I am doing a benchmark between these two libraries about speed up and scoring using Iris dataset (it is a try, in the future, I will change the dataset for a better benchmarking, I am starting with these two libraries). The problem is when I measure the score on CPU always get a value of 1.0 but when I try to measure the score on GPU I get a variable value between 0.2 and 1.0 and I do not understand why could be it

How to get the feature names in a different pipeline in sklearn in python

阅读更多关于 How to get the feature names in a different pipeline in sklearn in python

问题 I am using the following code (source) to concatenate multiple feature extraction methods. from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.decomposition import PCA from sklearn.feature_selection import SelectKBest iris = load_iris() X, y = iris.data, iris.target pca = PCA(n_components=2) selection = SelectKBest(k=1) # Build estimator from PCA and Univariate

sklearn use RandomizedSearchCV with custom metrics and catch Exceptions

阅读更多关于 sklearn use RandomizedSearchCV with custom metrics and catch Exceptions

问题 I am using the RandomizedSearchCV function in sklearn with a Random Forest Classifier. To see different metrics i am using a custom scoring from sklearn.metrics import make_scorer, roc_auc_score, recall_score, matthews_corrcoef, balanced_accuracy_score, accuracy_score acc = make_scorer(accuracy_score) auc_score = make_scorer(roc_auc_score) recall = make_scorer(recall_score) mcc = make_scorer(matthews_corrcoef) bal_acc = make_scorer(balanced_accuracy_score) scoring = {"roc_auc_score": auc

How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)?

阅读更多关于 How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)?

问题 In Python sklearn KMeans (see documentation), I was wondering what happens internally when passing an ndarray of shape (n, n_features) to the init parameter, When n<n_clusters Does it drop the given centroids and just starts a kmeans++ initialization which is the default choice for the init parameter ? (PDF paper kmeans++) (How does Kmeans++ work) Does it consider the given centroids and fill accordingly the remaining centroids using kmeans++ ? Does it consider the given centroids and fill

ModuleNotFoundError: No module named 'xgboost.sklearn'

阅读更多关于 ModuleNotFoundError: No module named 'xgboost.sklearn'

问题 I'm trying to import xgboost into jupyter-notebook but get the following error: --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-9-a585b270d0df> in <module> 1 import pandas as pd 2 import numpy as np ----> 3 import xgboost ~/.local/lib/python3.6/site-packages/xgboost/__init__.py in <module> 14 from . import tracker # noqa 15 from .tracker import RabitTracker # noqa ---> 16 from . import dask 17 try

Python package SHAP import

阅读更多关于 Python package SHAP import

问题 I installed Python package shap for plotting. conda install -c conda-forge shap After installing, I import shap in jupyter notebook but got error. import shap --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-132-efbb001a1501> in <module> ----> 1 import shap ~\AppData\Local\Continuum\anaconda3\lib\site-packages\shap\__init__.py in <module> 3 __version__ = '0.29.3' 4 ----> 5 from .explainers.kernel import

max_value and min_value for each column in scikit IterativeImputer

阅读更多关于 max_value and min_value for each column in scikit IterativeImputer

问题 I have this data set with 78 columns and 5707 rows. Almost every column has missing values and I would like to impute them with IterativeImputer. If I understood it correctly, it will make a "smarter" imputation on each column based on the information from other columns. However, when imputing, I do not want the imputed values to be less than the observed minimum or more than the observed maximum. I realize there are max_value and min_value parameters, but I do not want to impose a "global"

max_value and min_value for each column in scikit IterativeImputer

阅读更多关于 max_value and min_value for each column in scikit IterativeImputer