scikit-learn

《Python机器学习》笔记(三)

♀尐吖头ヾ 提交于 2020-12-05 20:43:06
使用scikit-learning 实现机器学习分类算法 分类算法的选择 没有免费的午餐理论:没有任何一种分类器可以在所有可能的应用场景下都有良好的表现。 实践证明,只有比较了多种学习算法的性能,才能为特定问题挑选出最合适的模型。这些模型针对不同数量的特征或样本、数据集中噪声的数量,以及类别是否线性可分等问题时,表现各不相同。 总而言之,分类器的性能、计算能力和预测能力,在很大程度上都依赖于用于模型训练的相关数据。训练机器学习算法所涉及的五个主要步骤可概述如下: 1.特征的选择 2.确定性能评价标准 3.选择分类器及其优化算法 4.对模型性能的评估 5.算法的调优 初涉scikit-learn的使用 使用scikit-learn训练感知器 import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Perceptron from sklearn.metrics import accuracy_score iris =

How to correctly perform cross validation in scikit-learn?

蓝咒 提交于 2020-12-05 12:25:51
问题 I am trying to do a cross validation on a k-nn classifier and I am confused about which of the following two methods below conducts cross validation correctly. training_scores = defaultdict(list) validation_f1_scores = defaultdict(list) validation_precision_scores = defaultdict(list) validation_recall_scores = defaultdict(list) validation_scores = defaultdict(list) def model_1(seed, X, Y): np.random.seed(seed) scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro'] model =

How to correctly perform cross validation in scikit-learn?

*爱你&永不变心* 提交于 2020-12-05 12:24:32
问题 I am trying to do a cross validation on a k-nn classifier and I am confused about which of the following two methods below conducts cross validation correctly. training_scores = defaultdict(list) validation_f1_scores = defaultdict(list) validation_precision_scores = defaultdict(list) validation_recall_scores = defaultdict(list) validation_scores = defaultdict(list) def model_1(seed, X, Y): np.random.seed(seed) scoring = ['accuracy', 'f1_macro', 'precision_macro', 'recall_macro'] model =

Kernel ridge and simple Ridge with Polynomial features

☆樱花仙子☆ 提交于 2020-12-05 12:15:37
问题 What is the difference between Kernel Ridge (from sklearn.kernel_ridge) with polynomial kernel and using PolynomialFeatures + Ridge (from sklearn.linear_model)? 回答1: The difference is in feature computation. PolynomialFeatures explicitly computes polynomial combinations between the input features up to the desired degree while KernelRidge(kernel='poly') only considers a polynomial kernel (a polynomial representation of feature dot products) which will be expressed in terms of the original

ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

巧了我就是萌 提交于 2020-12-03 07:26:12
问题 I have a dataset consisting of both numeric and categorical data and I want to predict adverse outcomes for patients based on their medical characteristics. I defined a prediction pipeline for my dataset like so: X = dataset.drop(columns=['target']) y = dataset['target'] # define categorical and numeric transformers numeric_transformer = Pipeline(steps=[ ('knnImputer', KNNImputer(n_neighbors=2, weights="uniform")), ('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[ (

机器学习 | 一个基于机器学习的简单小实践:波斯顿房价预测分析

旧时模样 提交于 2020-12-02 16:37:21
本 文采用Kaggle上面的Boston HousePrice数据集展示了如何建立机器学习模型的通常过程 ,包括以下几个阶段: 数据获取 数据清洗 探索性数据分析 特征工程 模型建立 模型集成 标签变量(房价) 采取了对数转换,使其符合正太分布,最后从12个备选模型中选出预测效果最好的6个模型Lasso,Ridge,SVR,KernelRidge,ElasticNet,BayesianRidge分别进行加权平均集成和Stacking集成,最后发现Stacking集成效果更好,创新之处在于将Stacking集成后的数据加入原训练集中再次训练Stacking集成模型,使得模型性能再次得到改善,作为最后的预测模型,预测结果提交kaggle上后表现不错。另外受限于训练时间,超参数搜索空间小,有待改善。 数据获取 Kaggle官网提供了大量的机器学习数据集,本文从其中选择了Boston HousePrice数据集,下载地址为https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data,下载后的数据集包括train.csv,test.csv,data_description.txt,sample_submission.csv四个文件,顾名思义train.csv为训练数据集,用于训练模型,test

scikit_learn

感情迁移 提交于 2020-12-02 08:21:30
scikit-learn 是基于 Python 语言的机器学习工具。 简单高效的数据挖掘和数据分析工具 可供大家在各种环境中重复使用 建立在 NumPy ,SciPy 和 matplotlib 上 开源,可商业使用 - BSD许可证 机器学习问题 : 监督学习 :数据带有我们想要预测的附加属性(各个属性已知) 1.分类:样本属于两个或更多类,从标记得数据训练并能预测出未标记的数据类别;另一个因素是,数据是离散的,我们想要使用正确的类别来标记这些数据。 2.回归:期望输出是一个或多个连续变量,则使用回归方法。比如预测人身高和体重的函数。 来源: oschina 链接: https://my.oschina.net/u/3955849/blog/2997421

Sklearn pass fit() parameters to xgboost in pipeline

本小妞迷上赌 提交于 2020-12-02 07:29:48
问题 Similar to How to pass a parameter to only one part of a pipeline object in scikit learn? I want to pass parameters to only one part of a pipeline. Usually, it should work fine like: estimator = XGBClassifier() pipeline = Pipeline([ ('clf', estimator) ]) and executed like pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20) but it fails with: /usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 114 """ 115 Xt, yt, fit_params = self._pre

Sklearn pass fit() parameters to xgboost in pipeline

旧巷老猫 提交于 2020-12-02 07:29:24
问题 Similar to How to pass a parameter to only one part of a pipeline object in scikit learn? I want to pass parameters to only one part of a pipeline. Usually, it should work fine like: estimator = XGBClassifier() pipeline = Pipeline([ ('clf', estimator) ]) and executed like pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20) but it fails with: /usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 114 """ 115 Xt, yt, fit_params = self._pre

CountVectorizer: Vocabulary wasn't fitted

♀尐吖头ヾ 提交于 2020-12-02 05:57:31
问题 I instantiated a sklearn.feature_extraction.text.CountVectorizer object by passing a vocabulary through the vocabulary argument, but I get a sklearn.utils.validation.NotFittedError: CountVectorizer - Vocabulary wasn't fitted. error message. Why? Example: import sklearn.feature_extraction import numpy as np import pickle # Save the vocabulary ngram_size = 1 dictionary_filepath = 'my_unigram_dictionary' vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram