scikit-learn

Sklearn Gaussian Mixture lock parameters?

心不动则不痛 提交于 2021-02-10 15:47:27
问题 I'm trying to fit some gaussians, of which I already have a pretty good idea about the initial parameters (in this case, I'm generating the distributions, so I should always be able to fit these). However, I can't seem to figure out how to force the mean to be e.g. 0 for both gaussians. Is it possible? m.means_ = ... doesn't work. from sklearn import mixture import numpy as np import math import matplotlib.pyplot as plt from scipy import stats a = np.random.normal(0, 0.2, 500) b = np.random

GridSearchCV on a working pipeline returns ValueError

ぐ巨炮叔叔 提交于 2021-02-10 15:16:04
问题 I am using GridSearchCV in order to find the best parameters for my pipeline. My pipeline seems to work well as I can apply: pipeline.fit(X_train, y_train) preds = pipeline.predict(X_test) And I get a decent result. But GridSearchCV obviously doesn't like something, and I cannot figure it out. My pipeline: feats = FeatureUnion([('age', age), ('education_num', education_num), ('is_education_favo', is_education_favo), ('is_marital_status_favo', is_marital_status_favo), ('hours_per_week', hours

GridSearchCV on a working pipeline returns ValueError

主宰稳场 提交于 2021-02-10 15:15:23
问题 I am using GridSearchCV in order to find the best parameters for my pipeline. My pipeline seems to work well as I can apply: pipeline.fit(X_train, y_train) preds = pipeline.predict(X_test) And I get a decent result. But GridSearchCV obviously doesn't like something, and I cannot figure it out. My pipeline: feats = FeatureUnion([('age', age), ('education_num', education_num), ('is_education_favo', is_education_favo), ('is_marital_status_favo', is_marital_status_favo), ('hours_per_week', hours

How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder?

笑着哭i 提交于 2021-02-10 15:10:07
问题 I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have a large dataset comprised of both numerical and categorical data. The dataset has lots of missing values. I am currently trying to encode the categorical features using OneHotEncoder. When I read about OneHotEncoder, my understanding was that for a missing value (NaN), OneHotEncoder would assign 0s to all the feature's categories, as such: 0 Male 1 Female 2 NaN

Getting probability of each new observation being an outlier when using scikit-learn OneClassSVM

大憨熊 提交于 2021-02-10 14:18:32
问题 I'm new to scikit-learn, and SVM methods in general. I've got my data set working well with scikit-learn OneClassSVM in order to detect outliers; I train the OneClassSVM using observation all of which are 'inliers' and then use predict() to generate binary inlier/outlier predictions on my testing set of data. However to continue further with my analysis I'd like to get the probabilities associated with each new observation in my test set. E.g. The probability of being an outlier associated

两步帮你快速选择SKlearn机器学习模型

蹲街弑〆低调 提交于 2021-02-10 14:15:36
Scikit-learn ,简称 Sklearn ,是使用最广泛的开源 Python 机器学习库。它基于 Numpy 和 Scipy ,提供了大量用于数据挖掘和机器学习分析、预测的工具,包括数据预处理、可视化、交叉验证和多种机器学习算法。其中提供的模型能够实现分类,回归,聚类,数据降维等功能。 Sklearn 是解决实际问题的一种工具,但面对机器学习问题时,最难的部分其实并不是缺乏工具,而是如何为具体项目找到合适的模型。 此处举 2 个案例。 案例 1 :老板丢给你 20 万个客户的淘宝网店购物记录,让你预测其中客户未来的一年内的生命周期价值 LTV ( Life TimeValue ),用 Scikit-learn 中的哪个模型? 案例 2 :老板丢给你 1000 个客户的下载、注册、使用和卸载 App 的行为记录,让你预测其中一些客户未来 3 个月内流失的可能性,用 Scikit-learn 中的哪个模型? Sklearn 中的机器学习模型这么多,怎么知道哪个模型适合处理什么类型的数据,解决什么样的问题呢? 步骤 1 :找到 Sklearn 官网提供的“工具选择流程图”。只要带着问题,跟着图往下走,就能够找到答案。 此处整理了中文版本,便于大家阅读。 步骤 2 :跟着图分解自己的案例。 以 案例 1 为例 : 从上图的“开始”往下走,进入“大于 50 个样本”环节 ;

How do you estimate the performance of a classifier on test data?

余生长醉 提交于 2021-02-10 12:36:43
问题 I'm using scikit to make a supervised classifier and I am currently tuning it to give me good accuracy on the labeled data. But how do I estimate how well it does on the test data (unlabeled)? Also, how do I find out if I'm starting to overfit the classifier? 回答1: You can't score your method on unlabeled data because you need to know right answers. In order to evaluate a method you should split your trainset into (new) train and test (via sklearn.cross_validation.train_test_split, for example

Using custom estimator with cross_val_score fails

|▌冷眼眸甩不掉的悲伤 提交于 2021-02-10 12:29:09
问题 I am trying to use cross_val_score with a customized estimator. It is important that this estimator receives a member variable which can be used later inside the fit function. But it seems inside cross_val_score the member variables are destroyed (or a new instance of the estimator is being created). Here is the minimal code which can reproduce the error: from sklearn.model_selection import cross_val_score from sklearn.base import BaseEstimator class MyEstimator(BaseEstimator): def __init__

'PolynomialFeatures' object has no attribute 'predict'

跟風遠走 提交于 2021-02-10 11:55:51
问题 I want to apply k-fold cross validation on the following regression models: Linear Regression Polynomial Regression Support Vector Regression Decision Tree Regression Random Forest Regression I am able to apply k-fold cross validation on all except polynomial regression which gives me this error PolynomialFeatures' object has no attribute 'predict . How to work around this issue. Also am I doing the job correctly, actually my main motive is to see which model is performing better, so is there

'PolynomialFeatures' object has no attribute 'predict'

核能气质少年 提交于 2021-02-10 11:55:15
问题 I want to apply k-fold cross validation on the following regression models: Linear Regression Polynomial Regression Support Vector Regression Decision Tree Regression Random Forest Regression I am able to apply k-fold cross validation on all except polynomial regression which gives me this error PolynomialFeatures' object has no attribute 'predict . How to work around this issue. Also am I doing the job correctly, actually my main motive is to see which model is performing better, so is there