scikit-learn | 易学教程

《Python机器学习》笔记（三）

阅读更多关于《Python机器学习》笔记（三）

使用scikit-learning 实现机器学习分类算法分类算法的选择没有免费的午餐理论：没有任何一种分类器可以在所有可能的应用场景下都有良好的表现。实践证明，只有比较了多种学习算法的性能，才能为特定问题挑选出最合适的模型。这些模型针对不同数量的特征或样本、数据集中噪声的数量，以及类别是否线性可分等问题时，表现各不相同。总而言之，分类器的性能、计算能力和预测能力，在很大程度上都依赖于用于模型训练的相关数据。训练机器学习算法所涉及的五个主要步骤可概述如下： 1.特征的选择 2.确定性能评价标准 3.选择分类器及其优化算法 4.对模型性能的评估 5.算法的调优初涉scikit-learn的使用使用scikit-learn训练感知器 import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Perceptron from sklearn.metrics import accuracy_score iris =

How to correctly perform cross validation in scikit-learn?

阅读更多关于 How to correctly perform cross validation in scikit-learn?

问题 I am trying to do a cross validation on a k-nn classifier and I am confused about which of the following two methods below conducts cross validation correctly. training_scores = defaultdict(list) validation_f1_scores = defaultdict(list) validation_precision_scores = defaultdict(list) validation_recall_scores = defaultdict(list) validation_scores = defaultdict(list) def model_1(seed, X, Y): np.random.seed(seed) scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro'] model =

How to correctly perform cross validation in scikit-learn?

阅读更多关于 How to correctly perform cross validation in scikit-learn?

Kernel ridge and simple Ridge with Polynomial features

阅读更多关于 Kernel ridge and simple Ridge with Polynomial features

问题 What is the difference between Kernel Ridge (from sklearn.kernel_ridge) with polynomial kernel and using PolynomialFeatures + Ridge (from sklearn.linear_model)? 回答1: The difference is in feature computation. PolynomialFeatures explicitly computes polynomial combinations between the input features up to the desired degree while KernelRidge(kernel='poly') only considers a polynomial kernel (a polynomial representation of feature dot products) which will be expressed in terms of the original

ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

阅读更多关于 ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

问题 I have a dataset consisting of both numeric and categorical data and I want to predict adverse outcomes for patients based on their medical characteristics. I defined a prediction pipeline for my dataset like so: X = dataset.drop(columns=['target']) y = dataset['target'] # define categorical and numeric transformers numeric_transformer = Pipeline(steps=[ ('knnImputer', KNNImputer(n_neighbors=2, weights="uniform")), ('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[ (

机器学习 | 一个基于机器学习的简单小实践：波斯顿房价预测分析

阅读更多关于机器学习 | 一个基于机器学习的简单小实践：波斯顿房价预测分析

本文采用Kaggle上面的Boston HousePrice数据集展示了如何建立机器学习模型的通常过程，包括以下几个阶段：数据获取数据清洗探索性数据分析特征工程模型建立模型集成标签变量（房价）采取了对数转换，使其符合正太分布，最后从12个备选模型中选出预测效果最好的6个模型Lasso，Ridge，SVR，KernelRidge，ElasticNet，BayesianRidge分别进行加权平均集成和Stacking集成，最后发现Stacking集成效果更好，创新之处在于将Stacking集成后的数据加入原训练集中再次训练Stacking集成模型，使得模型性能再次得到改善，作为最后的预测模型，预测结果提交kaggle上后表现不错。另外受限于训练时间，超参数搜索空间小，有待改善。数据获取 Kaggle官网提供了大量的机器学习数据集，本文从其中选择了Boston HousePrice数据集，下载地址为https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data，下载后的数据集包括train.csv，test.csv，data_description.txt，sample_submission.csv四个文件，顾名思义train.csv为训练数据集，用于训练模型，test

scikit_learn

阅读更多关于 scikit_learn

scikit-learn 是基于 Python 语言的机器学习工具。简单高效的数据挖掘和数据分析工具可供大家在各种环境中重复使用建立在 NumPy ，SciPy 和 matplotlib 上开源，可商业使用 - BSD许可证机器学习问题：监督学习：数据带有我们想要预测的附加属性(各个属性已知) 1.分类:样本属于两个或更多类，从标记得数据训练并能预测出未标记的数据类别；另一个因素是，数据是离散的，我们想要使用正确的类别来标记这些数据。 2.回归:期望输出是一个或多个连续变量，则使用回归方法。比如预测人身高和体重的函数。来源： oschina 链接： https://my.oschina.net/u/3955849/blog/2997421

Sklearn pass fit() parameters to xgboost in pipeline

阅读更多关于 Sklearn pass fit() parameters to xgboost in pipeline

问题 Similar to How to pass a parameter to only one part of a pipeline object in scikit learn? I want to pass parameters to only one part of a pipeline. Usually, it should work fine like: estimator = XGBClassifier() pipeline = Pipeline([ ('clf', estimator) ]) and executed like pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20) but it fails with: /usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 114 """ 115 Xt, yt, fit_params = self._pre

Sklearn pass fit() parameters to xgboost in pipeline

阅读更多关于 Sklearn pass fit() parameters to xgboost in pipeline

CountVectorizer: Vocabulary wasn't fitted

阅读更多关于 CountVectorizer: Vocabulary wasn't fitted

问题 I instantiated a sklearn.feature_extraction.text.CountVectorizer object by passing a vocabulary through the vocabulary argument, but I get a sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted. error message. Why? Example: import sklearn.feature_extraction import numpy as np import pickle # Save the vocabulary ngram_size = 1 dictionary_filepath = 'my_unigram_dictionary' vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram

订阅 scikit-learn