scikit-learn | 易学教程

Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error

阅读更多关于 Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error

问题 I am trying to chain Grid Search and Recursive Feature Elimination in a Pipeline using scikit-learn. GridSearchCV and RFE with "bare" classifier works fine: from sklearn.datasets import make_friedman1 from sklearn import feature_selection from sklearn.grid_search import GridSearchCV from sklearn.svm import SVR X, y = make_friedman1(n_samples=50, n_features=10, random_state=0) est = SVR(kernel="linear") selector = feature_selection.RFE(est) param_grid = dict(estimator__C=[0.1, 1, 10]) clf =

sklearn LabelBinarizer returns vector when there are 2 classes

阅读更多关于 sklearn LabelBinarizer returns vector when there are 2 classes

问题 The following code: from sklearn.preprocessing import LabelBinarizer lb = LabelBinarizer() lb.fit_transform(['yes', 'no', 'no', 'yes']) returns: array([[1], [0], [0], [1]]) However, I would like for there to be one column per class: array([[1, 0], [0, 1], [0, 1], [1, 0]]) (I need the data in this format so I can give it to a neural network that uses the softmax function at the output layer) When there are more than 2 classes, LabelBinarizer behaves as desired: from sklearn.preprocessing

What is the difference between LinearSVC and SVC(kernel=“linear”)?

阅读更多关于 What is the difference between LinearSVC and SVC(kernel=“linear”)?

问题 I found sklearn.svm.LinearSVC and sklearn.svm.SVC(kernel='linear') and they seem very similar to me, but I get very different results on Reuters. sklearn.svm.LinearSVC: 81.05% in 28.87s train / 9.71s test sklearn.svm.SVC : 33.55% in 6536.53s train / 2418.62s test Both have a linear kernel. The tolerance of the LinearSVC is higher than the one of SVC: LinearSVC(C=1.0, tol=0.0001, max_iter=1000, penalty='l2', loss='squared_hinge', dual=True, multi_class='ovr', fit_intercept=True, intercept

What is the difference between LinearSVC and SVC(kernel=“linear”)?

阅读更多关于 What is the difference between LinearSVC and SVC(kernel=“linear”)?

python机器学习入门

阅读更多关于 python机器学习入门

趣味机器学习入门小项目（附教程与数据）没有任何理论可以代替实践，虽然教材和课程能让你掌握一些基本原理，但在尝试应用时，你会发现具体操作起来比较困难。因此项目有助于提高应用机器学习的技巧，此外在找工作中也会给自己增添一些筹码。这个项目的目标是将现成模型应用到不同的数据集。首先，你会根据直觉为问题找到对应的模型，实践检验该模型是否对数据丢失具有鲁棒性、是否适合处理哪种类别特征；其次，本项目将教会你快速设计初始模型的技能，在实际应用中，我们一般会先找到一个简单模型进行快速实现以确定一个baseline，逐步提升模型性能，而不是一蹴而就的完成；最后，这个练习可以帮助你掌握建模的流程。下面我里除了一个机器学习问题处理的通用性步骤例如：导入数据数据清洗将数据集拆成训练/测试或交叉验证集预处理变换特征工程因为使用现成的模型，这促使你有更多的机会专注于学习上述的这些关键步骤，通过以下教程可以练习回归、分类和聚类算法。首先介绍一下该项目中所使用到的数据源： UCI机器学习库——350多个可检索数据集，几乎涵盖每一个主题。 http://archive.ics.uci.edu/ml/ Kaggle数据集——Kaggle社区上的100多个数据集。 https://www.kaggle.com/datasets Data.gov——由美国政府发布的开放数据集。 https://www

sklearn svm基本使用

阅读更多关于 sklearn svm基本使用

SVM基本使用　　　　SVM在解决分类问题具有良好的效果，出名的软件包有 libsvm (支持多种核函数), liblinear 。此外python机器学习库scikit-learn也有svm相关算法， sklearn.svm. SVC 和 sklearn.svm.LinearSVC 分别由libsvm和liblinear发展而来。　　推荐使用SVM的步骤为：将原始数据转化为SVM算法软件或包所能识别的数据格式；将数据标准化；(防止样本中不同特征数值大小相差较大影响分类器性能) 不知使用什么核函数，考虑使用RBF；利用交叉验证网格搜索寻找最优参数(C, γ)；（交叉验证防止过拟合，网格搜索在指定范围内寻找最优参数）使用最优参数来训练模型；测试。下面利用scikit-learn说明上述步骤： 1 import numpy as np 2 from sklearn.svm import SVC 3 from sklearn.preprocessing import StandardScaler 4 from sklearn.model_selection import GridSearchCV, train_test_split 5 6 def load_data(filename) 7 ''' 8 假设这是鸢尾花数据,csv数据格式为： 9 0,5.1,3.5,1.4

sklearn中的SVM

阅读更多关于 sklearn中的SVM

Print predict ValueError: Expected 2D array, got 1D array instead

阅读更多关于 Print predict ValueError: Expected 2D array, got 1D array instead

问题 The error shows in my last two codes. ValueError: Expected 2D array, got 1D array instead: array=[0 1]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. import numpy as np import pandas as pd from sklearn.model_selection import ShuffleSplit %matplotlib inline df = pd.read_csv('.......csv') df.drop(['Company'], 1, inplace=True) x = pd.DataFrame(df.drop(['R&D Expense'],1)) y = pd.DataFrame(df['R&D

Random_state's contribution to accuracy

阅读更多关于 Random_state's contribution to accuracy

问题 Okay, this is interesting.. I executed the same code a couple of times and each time I got a different accuracy_score . I figured that I was not using any random_state value while train_test splitting . so I used random_state=0 and got consistent Accuracy_score of 82%. but... then I thought to give it a try with different random_state number and I set random_state=128 and Accuracy_score becomes 84%. Now I need to understand why is that and how random_state affects the accuracy of the model.

predicitng new value through a model trained on one hot encoded data

阅读更多关于 predicitng new value through a model trained on one hot encoded data

问题 This might look like a trivial problem. But I am getting stuck in predicting results from a model. My problem is like this: I have a dataset of shape 1000 x 19 (except target feature) but after one hot encoding it becomes 1000 x 141. Since I trained the model on the data which is of shape 1000 x 141, so I need data of shape 1 x 141 (at least) for prediction. I also know in python, I can make future prediction using model.predict(data) But, since I am getting data from an end user through a

订阅 scikit-learn