scikit-learn

Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error

早过忘川 提交于 2021-02-17 19:08:41
问题 I am trying to chain Grid Search and Recursive Feature Elimination in a Pipeline using scikit-learn. GridSearchCV and RFE with "bare" classifier works fine: from sklearn.datasets import make_friedman1 from sklearn import feature_selection from sklearn.grid_search import GridSearchCV from sklearn.svm import SVR X, y = make_friedman1(n_samples=50, n_features=10, random_state=0) est = SVR(kernel="linear") selector = feature_selection.RFE(est) param_grid = dict(estimator__C=[0.1, 1, 10]) clf =

sklearn LabelBinarizer returns vector when there are 2 classes

青春壹個敷衍的年華 提交于 2021-02-17 12:45:16
问题 The following code: from sklearn.preprocessing import LabelBinarizer lb = LabelBinarizer() lb.fit_transform(['yes', 'no', 'no', 'yes']) returns: array([[1], [0], [0], [1]]) However, I would like for there to be one column per class: array([[1, 0], [0, 1], [0, 1], [1, 0]]) (I need the data in this format so I can give it to a neural network that uses the softmax function at the output layer) When there are more than 2 classes, LabelBinarizer behaves as desired: from sklearn.preprocessing

What is the difference between LinearSVC and SVC(kernel=“linear”)?

孤人 提交于 2021-02-17 10:46:17
问题 I found sklearn.svm.LinearSVC and sklearn.svm.SVC(kernel='linear') and they seem very similar to me, but I get very different results on Reuters. sklearn.svm.LinearSVC: 81.05% in 28.87s train / 9.71s test sklearn.svm.SVC : 33.55% in 6536.53s train / 2418.62s test Both have a linear kernel. The tolerance of the LinearSVC is higher than the one of SVC: LinearSVC(C=1.0, tol=0.0001, max_iter=1000, penalty='l2', loss='squared_hinge', dual=True, multi_class='ovr', fit_intercept=True, intercept

What is the difference between LinearSVC and SVC(kernel=“linear”)?

半城伤御伤魂 提交于 2021-02-17 10:44:55
问题 I found sklearn.svm.LinearSVC and sklearn.svm.SVC(kernel='linear') and they seem very similar to me, but I get very different results on Reuters. sklearn.svm.LinearSVC: 81.05% in 28.87s train / 9.71s test sklearn.svm.SVC : 33.55% in 6536.53s train / 2418.62s test Both have a linear kernel. The tolerance of the LinearSVC is higher than the one of SVC: LinearSVC(C=1.0, tol=0.0001, max_iter=1000, penalty='l2', loss='squared_hinge', dual=True, multi_class='ovr', fit_intercept=True, intercept

python机器学习入门

独自空忆成欢 提交于 2021-02-17 08:58:09
趣味机器学习入门小项目(附教程与数据) 没有任何理论可以代替实践,虽然教材和课程能让你掌握一些基本原理,但在尝试应用时,你会发现具体操作起来比较困难。因此项目有助于提高应用机器学习的技巧,此外在找工作中也会给自己增添一些筹码。 这个项目的目标是将现成模型应用到不同的数据集。首先,你会根据直觉为问题找到对应的模型,实践检验该模型是否对数据丢失具有鲁棒性、是否适合处理哪种类别特征;其次,本项目将教会你快速设计初始模型的技能,在实际应用中,我们一般会先找到一个简单模型进行快速实现以确定一个baseline,逐步提升模型性能,而不是一蹴而就的完成;最后,这个练习可以帮助你掌握建模的流程。下面我里除了一个机器学习问题处理的通用性步骤例如: 导入数据 数据清洗 将数据集拆成训练/测试或交叉验证集 预处理 变换 特征工程 因为使用现成的模型,这促使你有更多的机会专注于学习上述的这些关键步骤,通过以下教程可以练习回归、分类和聚类算法。 首先介绍一下该项目中所使用到的数据源: UCI机器学习库——350多个可检索数据集,几乎涵盖每一个主题。 http://archive.ics.uci.edu/ml/ Kaggle数据集——Kaggle社区上的100多个数据集。 https://www.kaggle.com/datasets Data.gov——由美国政府发布的开放数据集。 https://www

sklearn svm基本使用

拟墨画扇 提交于 2021-02-17 07:55:48
SVM基本使用     SVM在解决分类问题具有良好的效果,出名的软件包有 libsvm (支持多种核函数), liblinear 。此外python机器学习库scikit-learn也有svm相关算法, sklearn.svm. SVC 和 sklearn.svm.LinearSVC 分别由libsvm和liblinear发展而来。   推荐使用SVM的步骤为: 将原始数据转化为SVM算法软件或包所能识别的数据格式; 将数据标准化;(防止样本中不同特征数值大小相差较大影响分类器性能) 不知使用什么核函数,考虑使用RBF; 利用交叉验证网格搜索寻找最优参数(C, γ);(交叉验证防止过拟合,网格搜索在指定范围内寻找最优参数) 使用最优参数来训练模型; 测试。 下面利用scikit-learn说明上述步骤: 1 import numpy as np 2 from sklearn.svm import SVC 3 from sklearn.preprocessing import StandardScaler 4 from sklearn.model_selection import GridSearchCV, train_test_split 5 6 def load_data(filename) 7 ''' 8 假设这是鸢尾花数据,csv数据格式为: 9 0,5.1,3.5,1.4

sklearn中的SVM

这一生的挚爱 提交于 2021-02-17 07:26:10
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>          scikit-learn中SVM的算法库分为两类,一类是分类的算法库,包括SVC, NuSVC,和LinearSVC 3个类。另一类是回归算法库,包括SVR, NuSVR,和LinearSVR 3个类。相关的类都包裹在sklearn.svm模块之中。        对于SVC, NuSVC,和LinearSVC 3个分类的类,SVC和 NuSVC差不多,区别仅仅在于对损失的度量方式不同,而LinearSVC从名字就可以看出,他是线性分类,也就是不支持各种低维到高维的核函数,仅仅支持线性核函数,对线性不可分的数据不能使用。   同样的,对于SVR, NuSVR,和LinearSVR 3个回归的类, SVR和NuSVR差不多,区别也仅仅在于对损失的度量方式不同。LinearSVR是线性回归,只能使用线性核函数。   我们使用这些类的时候,如果有经验知道数据是线性可以拟合的,那么使用LinearSVC去分类 或者LinearSVR去回归,它们不需要我们去慢慢的调参去选择各种核函数以及对应参数, 速度也快。如果我们对数据分布没有什么经验

Print predict ValueError: Expected 2D array, got 1D array instead

ぃ、小莉子 提交于 2021-02-17 07:19:29
问题 The error shows in my last two codes. ValueError: Expected 2D array, got 1D array instead: array=[0 1]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. import numpy as np import pandas as pd from sklearn.model_selection import ShuffleSplit %matplotlib inline df = pd.read_csv('.......csv') df.drop(['Company'], 1, inplace=True) x = pd.DataFrame(df.drop(['R&D Expense'],1)) y = pd.DataFrame(df['R&D

Random_state's contribution to accuracy

江枫思渺然 提交于 2021-02-17 06:30:51
问题 Okay, this is interesting.. I executed the same code a couple of times and each time I got a different accuracy_score . I figured that I was not using any random_state value while train_test splitting . so I used random_state=0 and got consistent Accuracy_score of 82%. but... then I thought to give it a try with different random_state number and I set random_state=128 and Accuracy_score becomes 84%. Now I need to understand why is that and how random_state affects the accuracy of the model.

predicitng new value through a model trained on one hot encoded data

送分小仙女□ 提交于 2021-02-17 04:44:05
问题 This might look like a trivial problem. But I am getting stuck in predicting results from a model. My problem is like this: I have a dataset of shape 1000 x 19 (except target feature) but after one hot encoding it becomes 1000 x 141. Since I trained the model on the data which is of shape 1000 x 141, so I need data of shape 1 x 141 (at least) for prediction. I also know in python, I can make future prediction using model.predict(data) But, since I am getting data from an end user through a