scikit-learn

python数据分析——在python中实现线性回归

断了今生、忘了曾经 提交于 2021-01-23 13:20:15
线性回归 是基本的统计和机器学习技术之一。经济,计算机科学,社会科学等等学科中,无论是统计分析,或者是机器学习,还是科学计算,都有很大的机会需要用到线性模型。建议先学习它,然后再尝试更复杂的方法。 本文主要介绍 如何逐步在Python中实现线性回归。 而至于线性回归的数学推导、线性回归具体怎样工作,参数选择如何改进回归模型将在以后说明。 回归 回归分析是统计和机器学习中最重要的领域之一。有许多可用的回归方法。线性回归就是其中之一。而线性回归可能是最重要且使用最广泛的回归技术之一。这是最简单的回归方法之一。它的主要优点之一是线性回归得到的结果十分容易解释。那么回归主要有: 简单线性回归 多元线性回归 多项式回归 如何在python中实现线性回归 用到的packages NumPy NumPy 是Python的基础科学软件包,它允许在单维和多维数组上执行许多高性能操作。 scikit-learn scikit-learn 是在NumPy和其他一些软件包的基础上广泛使用的Python机器学习库。它提供了预处理数据,减少维数,实现回归,分类,聚类等的方法。 statsmodels 如果要实现线性回归并且需要功能超出scikit-learn的范围,则应考虑使用 statsmodels 可以用于估算统计模型,执行测试等。 scikit-learn的简单线性回归 1

XGboost: cannot pass validation data for eval_set in pipeline

一曲冷凌霜 提交于 2021-01-22 12:12:22
问题 I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params XGBmodel = XGBRegressor(random_state=0) pipe = Pipeline(steps=[ ('preprocess', preprocessor), ('XGBmodel', XGBmodel) ]) And I want to pass these fit params fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], "XGBmodel__early_stopping_rounds": 10, "XGBmodel__verbose": False} I am trying to fit model searchCV = GridSearchCV(pipe, cv=5, param_grid=param

XGboost: cannot pass validation data for eval_set in pipeline

守給你的承諾、 提交于 2021-01-22 12:10:53
问题 I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params XGBmodel = XGBRegressor(random_state=0) pipe = Pipeline(steps=[ ('preprocess', preprocessor), ('XGBmodel', XGBmodel) ]) And I want to pass these fit params fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], "XGBmodel__early_stopping_rounds": 10, "XGBmodel__verbose": False} I am trying to fit model searchCV = GridSearchCV(pipe, cv=5, param_grid=param

Python - linear regression TypeError: invalid type promotion

北战南征 提交于 2021-01-21 08:39:13
问题 i am trying to run linear regression and i am having issues with data type i think. I have tested line by line and everything works until i reach last line where i get the issue TypeError: invalid Type promotion. Based on my research i think it is due to date format. Here is my code: import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression data=pd.read_excel('C:\\Users\\Proximo

Scikit-Learn Decision Tree: Probability of prediction being a or b?

半世苍凉 提交于 2021-01-21 08:27:19
问题 I have a basic decision tree classifier with Scikit-Learn: #Used to determine men from women based on height and shoe size from sklearn import tree #height and shoe size X = [[65,9],[67,7],[70,11],[62,6],[60,7],[72,13],[66,10],[67,7.5]] Y=["male","female","male","female","female","male","male","female"] #creating a decision tree clf = tree.DecisionTreeClassifier() #fitting the data to the tree clf.fit(X, Y) #predicting the gender based on a prediction prediction = clf.predict([68,9]) #print

How can I plot the probability density function for a fitted Gaussian mixture model under scikit-learn?

▼魔方 西西 提交于 2021-01-21 06:25:52
问题 I'm struggling with a rather simple task. I have a vector of floats to which I would like to fit a Gaussian mixture model with two Gaussian kernels: from sklearn.mixture import GMM gmm = GMM(n_components=2) gmm.fit(values) # values is numpy vector of floats I would now like to plot the probability density function for the mixture model I've created, but I can't seem to find any documentation on how to do this. How should I best proceed? Edit: Here is the vector of data I'm fitting. And below

Using sklearn's RandomizedSearchCV with SMOTE oversampling only on training folds

我是研究僧i 提交于 2021-01-21 05:34:11
问题 I have a highly unbalanced dataset (99.5:0.5). I would like to perform hyperparameter tuning on a Random Forest model using sklearn 's RandomizedSearchCV . I would like each of the training folds to be oversampled using SMOTE, and then each of the tests to be evaluated on the final fold, keeping the original distribution without any oversampling. Since these test folds are highly unbalanced, I would like the tests to be evaluated using the F1 Score. I have tried the following: from sklearn

How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

三世轮回 提交于 2021-01-20 16:50:57
问题 I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI) , and p-value to access statistical significance. Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores. Here are my specific questions: How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g

How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

岁酱吖の 提交于 2021-01-20 16:42:37
问题 I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI) , and p-value to access statistical significance. Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores. Here are my specific questions: How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g

How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

我的未来我决定 提交于 2021-01-20 16:42:33
问题 I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI) , and p-value to access statistical significance. Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores. Here are my specific questions: How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g