regression

regression model evaluation using scikit-learn

痞子三分冷 提交于 2019-12-07 14:11:41
问题 I am doing regression with sklearn and use random grid search to evaluate different parameters. Here is a toy example: from sklearn.datasets import make_regression from sklearn.metrics import mean_squared_error, make_scorer from scipy.stats import randint as sp_randint from sklearn.ensemble import ExtraTreesRegressor from sklearn.cross_validation import LeaveOneOut from sklearn.grid_search import GridSearchCV, RandomizedSearchCV X, y = make_regression(n_samples=10, n_features=10, n

小白机器学习基础算法学习必经之路(下)

ぃ、小莉子 提交于 2019-12-07 14:03:54
我们在上文 小白机器学习基础算法学习必经之路(上) 简述了线性回归 (Linear Regression) ,逻辑回归 (Logistic Regression) ,决策树 (Decision Tree) ,支持向量机(SVM) ,朴素贝叶斯 (Naive Bayes) 现在我们接着继续学习另五个算法: K邻近算法(KNN) k-NN算法是最简单的分类算法,主要的思想是计算待分类样本与训练样本之间的差异性,并将差异按照由小到大排序,选出前面K个差异最小的类别,并统计在K个中类别出现次数最多的类别为最相似的类,最终将待分类样本分到最相似的训练样本的类中。与投票(Vote)的机制类似。 k-近邻算法是基于实例的学习,使用算法时我们必须有接近实际数据的训练样本数据。 优点:精度高,对异常值不敏感,无数据输入假定 缺点:时间和空间复杂度高,无法获取样本特征 数据:数值型和标称型 k-均值算法(K-means) KMeans算法是典型的基于距离的聚类算法,采用距离作为相似性的评价指标,即认为两个对象的距离越近,其相似度就越大。该算法认为簇是由距离靠近的对象组成的,因此把得到紧凑且**的簇作为最终目标。 K个初始聚类中心点的选取对聚类结果具有较大的影响,因为在该算法第一步中是随机地选取任意k个对象作为初始聚类中心,初始地代表一个簇。该算法在每次迭代中对数据集中剩余的每个对象

Logistic regression results different in Scikit python and R?

淺唱寂寞╮ 提交于 2019-12-07 13:04:40
问题 I was running logistic regression on iris dataset on both R and Python.But both are giving different results(coefficients,intercept and scores). #Python codes. In[23]: iris_df.head(5) Out[23]: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 In[35]: iris_df.shape Out[35]: (100, 5) #looking at the levels of the Species dependent variable.. In[25]: iris_df['Species'].unique() Out[25]: array([0, 1], dtype

Quantile regression and p-values - getting more decimal places

故事扮演 提交于 2019-12-07 11:27:29
问题 Using R, and package quantreg , I am performing quantile regression analyses to my data. I can get access to the p-values using the se (standard error) estimator in the summary function, as below, however I only get 5 decimal places, and would like more. model <- rq(outcome ~ predictor) summary(model, se="ker") Call: rq(formula = outcome ~ predictor) tau: [1] 0.5 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 78.68182 2.89984 27.13312 0.00000 predictor 0.22727 0.03885 5.84943 0

How can I add regression lines to a plot that has multiple data series that are colour coded by a factor?

喜你入骨 提交于 2019-12-07 11:19:40
问题 I wish to add regression lines to a plot that has multiple data series that are colour coded by a factor. Using a brewer.pal palette, I created a plot with the data points coloured by factor (plant$ID). Below is an example of the code: palette(brewer.pal(12,"Paired")) plot(x=plant$TL, y=plant$d15N, xlab="Total length (mm)", ylab="d15N", col=plant$ID, pch=16) legend(locator(1), legend=levels(factor(plant$ID)), text.col="black", pch=16, col=c(brewer.pal(12,"Paired")), cex=0.6) Is there an easy

Time series prediction using support vector regression

蹲街弑〆低调 提交于 2019-12-07 05:30:25
问题 I've been trying to implement time series prediction tool using support vector regression in python language. I use SVR module from scikit-learn for non-linear Support vector regression. But I have serious problem with prediction of future events. The regression line fits the original function great (from known data) but as soon as I want to predict future steps, it returns value from the last known step. My code looks like this: import numpy as np from matplotlib import pyplot as plt from

Scikit-Learn Classification and Regression with Weights

谁都会走 提交于 2019-12-07 04:20:28
问题 How can I do classification or regression in sklearn if I want to weight each sample differently? Is there a way to do it with a custom loss function? If so, what does that loss function look like in general? Is there an easier way? 回答1: To weigh individual samples, feed a sample_weight array to the estimator's fit method. This should be a 1-d array of length n_samples (i.e. the same dimension as y in most tasks): estimator.fit(X, y, sample_weight=some_array) Not all models support this,

小白机器学习基础算法学习必经之路(上)

江枫思渺然 提交于 2019-12-06 23:16:39
常见的机器学习算法 以下是最常用的机器学习算法,大部分数据问题都可以通过它们解决: 1.线性回归 (Linear Regression) 2.逻辑回归 (Logistic Regression) 3.决策树 (Decision Tree) 4.支持向量机(SVM) 5.朴素贝叶斯 (Naive Bayes) 6.K邻近算法(KNN) 7.K-均值算法(K-means) 8.随机森林 (Random Forest) 9.降低维度算法(DimensionalityReduction Algorithms) 10.GradientBoost和Adaboost算法 线性回归 (Linear Regression) 线性回归是利用数理统计中回归分析,来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法,运用十分广泛。其表达形式为y = w'x+e,e为误差服从均值为0的正态分布。 最小二乘法是一种计算线性回归的方法。你可以把线性回归当做在一系列的点中画一条合适的直线的任务。有很多种方法可以实现这个,“最小二乘法”是这样做的 —你画一条线,然后为每个数据点测量点与线之间的垂直距离,并将这些全部相加,最终得到的拟合线将在这个相加的总距离上尽可能最小。 逻辑回归 (Logistic Regression) 逻辑回归是一种强大的统计方法,它能建模出一个二项结果与一个(或多个)解释变量

plot.lm(): extracting numbers labelled in the diagnostic Q-Q plot

前提是你 提交于 2019-12-06 19:34:48
问题 For the simple example below, you can see that there are certain points that are identified in the ensuing plots. How can I extract the row numbers identified in these plots, especially the Normal Q-Q plot? set.seed(2016) maya <- data.frame(rnorm(100)) names(maya)[1] <- "a" maya$b <- rnorm(100) mara <- lm(b~a, data=maya) plot(mara) I tried using str(mara) to see if I could find a list there, but I can't see any of the numbers from the Normal Q-Q plot there. Thoughts? 回答1: I have edited your

Any simple way to get regression prediction intervals in R?

泄露秘密 提交于 2019-12-06 16:25:56
问题 I am working on a big data set having over 300K elements, and running some regression analysis trying to estimate a parameter called Rate using the predictor variable Distance. I have the regression equation. Now I want to get the confidence and prediction intervals. I can easily get the confidence intervals for the coefficients by the command: > confint(W1500.LR1, level = 0.95) 2.5 % 97.5 % (Intercept) 666.2817393 668.0216072 Distance 0.3934499 0.3946572 which gives me the upper and lower