scikit-learn

Gaussian Process Regression: standard deviation meaning

不想你离开。 提交于 2021-01-28 05:58:20
问题 In the following code about the Gaussian Process Regression (GPR): from sklearn.datasets import make_friedman2 from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel X, y = make_friedman2(n_samples=500, noise=0, random_state=0) kernel = DotProduct() + WhiteKernel() gpr = GaussianProcessRegressor(kernel=kernel, random_state=0).fit(X, y) print gpr.score(X, y) print gpr.predict(X[:2,:], return_std=True) What is the

Gaussian Process Regression: standard deviation meaning

瘦欲@ 提交于 2021-01-28 04:51:40
问题 In the following code about the Gaussian Process Regression (GPR): from sklearn.datasets import make_friedman2 from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel X, y = make_friedman2(n_samples=500, noise=0, random_state=0) kernel = DotProduct() + WhiteKernel() gpr = GaussianProcessRegressor(kernel=kernel, random_state=0).fit(X, y) print gpr.score(X, y) print gpr.predict(X[:2,:], return_std=True) What is the

How to get tf-idf matrix of a large size corpus, where features are pre-specified?

会有一股神秘感。 提交于 2021-01-28 04:49:45
问题 I have a corpus consisting 3,500,000 text documents. I want to construct a tf-idf matrix of (3,500,000 * 5,000) size. Here I have 5,000 distinct features (words). I am using scikit sklearn in python. Where I am using TfidfVectorizer to do that. I have constructed a dictionary of 5000 size(one for each feature). While initializing the TfidfVectorizer I am setting the parameter vocabulary with the dictionary of features. But while calling the fit_transform , it is showing some memory-map and

How to split sparse matrix into train and test sets?

[亡魂溺海] 提交于 2021-01-28 04:09:22
问题 I want to understand how to work with sparse matrices. I have this code to generate multi-label classification data set as a sparse matrix. from sklearn.datasets import make_multilabel_classification X, y = make_multilabel_classification(sparse = True, n_labels = 20, return_indicator = 'sparse', allow_unlabeled = False) This code gives me X in the following format: <100x20 sparse matrix of type '<class 'numpy.float64'>' with 1797 stored elements in Compressed Sparse Row format> y: <100x5

How do I restrict the number of processors used by the ridge regression model in sklearn?

坚强是说给别人听的谎言 提交于 2021-01-28 03:21:59
问题 I want to make a fair comparison between different machine learning models. However, I find that the ridge regression model will automatically use multiple processors and there is no parameter that I can restrict the number of used processors (such as n_jobs). Is there any possible way to solve this problem? A minimal example: from sklearn.datasets import make_regression from sklearn.linear_model import RidgeCV features, target = make_regression(n_samples=10000, n_features=1000) r = RidgeCV()

Feature importance in logistic regression with bagging classifier

做~自己de王妃 提交于 2021-01-28 00:10:27
问题 I am working on a binary classification problem which I am using the logistic regression within bagging classifer. Few lines of code are as follows:- model = BaggingClassifier(LogisticRegression(), n_estimators=10, bootstrap = True, random_state = 1) model.fit(X,y,sample_weights) I am intrested in knowing feature importance metric for this model. How can this be done if estimator for bagging classifer is logistic regression? I am able to get the feature importance when decision tree is used

GridSearchCV output problems in Scikit-learn

不问归期 提交于 2021-01-27 20:57:03
问题 I'd like to perform a hyperparameter search for selecting preprocessing steps and models in sklearn as follows: pipeline = Pipeline([("combiner", PolynomialFeatures()), ("dimred", PCA()), ("classifier", RandomForestClassifier())]) parameters = [{"combiner": [None]}, {"combiner": [PolynomialFeatures()], "combiner__degree": [2], "combiner__interaction_only": [False, True]}, {"dimred": [None]}, {"dimred": [PCA()], "dimred__n_components": [.95, .75]}, {"classifier": [RandomForestClassifier(n

Exception in thread QueueManagerThread - scikit-learn

荒凉一梦 提交于 2021-01-27 18:50:19
问题 When I set n_jobs=-1 I get error and if I set n_jobs equal big value (n_jobs=100), but if set smaller value (e.g. n_jobs=32), it works fine. I've tried reinstall scikit-learn and joblib packages, but to no avail. Also, it (n_jobs=-1) works fine previously, but suddenly go wrong. from sklearn import datasets from sklearn.model_selection import cross_validate, StratifiedKFold from sklearn.linear_model import RidgeClassifier iris = datasets.load_iris() iris_X = iris.data iris_y = iris.target skf

Scikit-learn how to check if model (e.g. TfidfVectorizer) has been already fit

▼魔方 西西 提交于 2021-01-27 13:40:25
问题 For feature extraction from text, how to check if a vectorizer (e.g. TfIdfVectorizer or CountVectorizer) has been already fit on a training data? In particular, I want the code to automatically figure out if a vectorizer has been already fit. from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer() def vectorize_data(texts): # if vectorizer has not been already fit vectorizer.fit_transform(texts) # else vectorizer.transform(texts) 回答1: You can use the check

How can I use logistic regression in sklearn for continuous but bounded dependent variable?

萝らか妹 提交于 2021-01-27 12:10:25
问题 How can I use logistic regression in sklearn for continiuos but bounded (0<=y<=1) dependent variable? If it's not possible in sklearn, with what library can I do it? 回答1: See the discussion here: https://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable There are two suggestions: Stop doing logistic regression on something that is not a binary target Use statsmodels https://www.statsmodels.org 回答2: It completly depends on your distribution of