mean

Box plot showing mean as a line

元气小坏坏 提交于 2019-12-22 05:04:44
问题 Is it possible to create a boxplot that shows both mean and median as a line with the standard boxplot function of R ? My current solution displays the mean as a cross: set.seed(1234) values <- runif(10,0,1) boxplot(values) points(mean(values),col="red",pch=4,lwd = 4) 回答1: For the sake of completeness, you could also overplot: set.seed(753) df <- data.frame(y=rt(100, 4), x=gl(5, 20)) bx.p <- boxplot(y~x, df) bx.p$stats[3, ] <- unclass(with(df, by(y, x, FUN = mean))) bxp(bx.p, add=T, boxfill=

《利用python进行数据分析》读书笔记--第九章 数据聚合与分组运算(一)

北慕城南 提交于 2019-12-22 00:08:08
http://www.cnblogs.com/batteryhp/p/5046450.html 对数据进行分组并对各组应用一个函数,是数据分析的重要环节。数据准备好之后,通常的任务就是计算分组统计或生成透视表。groupby函数能高效处理数据,对数据进行切片、切块、摘要等操作。可以看出这跟SQL关系密切,但是可用的函数有很多。在本章中,可以学到: 根据一个或多个键(可以是函数、数组或DataFrame列名)拆分pandas对象 计算分组摘要统计,如计数、平均值、标准差、,或自定义函数 对DataFrame的列应用各种各样的函数 应用组内转换或其他运算,如规格化、线性回归、排名或选取子集等 计算透视表和交叉表 执行分位数分析以及其他分组分析 对时间数据的聚合也称重采样(resampling),在第十章介绍。 1、GroupBy技术 很多数据处理过程都经历“拆分-应用-合并”的过程。即根据一个或多个键进行分组、每一个应用函数、再进行合并。 分组键有多种形式: 列表或数组,长度与待分组的轴一样 表示DataFrame某个列明的值 字典或Series,给出待分组轴上的值与分组名之间的对应关系 函数,用于处理轴索引或索引中的各个标签 下面开始写例子。 简单实例 #-*- encoding: utf-8 –*- #分组实例 import numpy as np import pandas as

机器学习 回归问题(线性回归 岭回归 逐步回归)

╄→尐↘猪︶ㄣ 提交于 2019-12-21 19:56:51
一.线性回归 线性回归就是将输入项分别乘以一些常量,在将结果加起来得到输出。 假定输入数据存放在矩阵 x 中,而回归系数存放在向量 w 中。 那么预测结果可以通过Y=X的转置*W得出。所以我们求解线性回归模型的核心就在于求解w,如何求呢?首先,我们一定是希望预测出来的值和实际值之间的误差越小越好,所以我们评判w好坏,就可以采用实际值与真实值之差表示,但是这个差有正有负,为了避免正负相互抵消的情况,我们采用平方误差(也就是最小二乘法) 平方误差,我们也可以叫他损失函数。我们现在就是要以w为变量求解损失函数的最小值。 我们可以对w进行求导,令其为0,可得到我们所要求解w所需的计算公式。 局部加权线性回归 线性回归的一个问题是有可能出现欠拟合现象,因为它求的是具有小均方误差的无偏估 计。显而易见,如果模型欠拟合将不能取得好的预测效果。所以有些方法允许在估计中引入一 些偏差,从而降低预测的均方误差。 其中的一个方法是局部加权线性回归。在该算法中,我们给待预测点附近的每个点赋予一定的权重;在这个子集上基于 小均方差来进行普通的回归。 局部加权线性回归的基本思想:设计代价函数时,待预测点附近的点拥有更高的权重,权重随着距离的增大而缩减——这也就是名字中“局部”和“加权”的由来。 权重如何求取: 区别在于此时的代价函数中多了一个权重函数W,这个W要保证,越靠近待测点附近权值越大

column vector with row means — with std::accumulate?

泪湿孤枕 提交于 2019-12-21 17:22:21
问题 In an effort to be as lazy as possible I read in a matrix as vector< vector<double> > data ( rows, vector<double> ( columns ) ); and try to use as many STL goodies as I can. One thing I need to do next is to compute the row means. In C-style programming that would be vector<double> rowmeans( data.size() ); for ( int i=0; i<data.size(); i++ ) for ( int j=0; j<data[i].size(); j++ ) rowmeans[i] += data[i][j]/data[i].size(); In In C++, how to compute the mean of a vector of integers using a

Tensorflow之计算tensor平均值

守給你的承諾、 提交于 2019-12-21 14:09:39
https://www.tensorflow.org/versions/r0.12/api_docs/python/math_ops.html#reduce_mean tf.reduce_mean(input_tensor, axis=None, keep_dims=False, name=None, reduction_indices=None) 计算tensor中各个维度上元素的平均值. 在给定维度axis上进行删减. keep_dims被设置为false的话, 原始变量的维度会减少1. 如果不对axis进行赋值, 那么返回所有元素的平均值. 例子: # 'x' is [[1., 1.] # [2., 2.]] tf.reduce_mean(x) ==> 1.5 tf.reduce_mean(x, 0) ==> [1.5, 1.5] tf.reduce_mean(x, 1) ==> [1., 2.] 来源: https://www.cnblogs.com/huangshiyu13/p/6534264.html

How to create mean and s.d. columns in data.table

无人久伴 提交于 2019-12-21 13:41:35
问题 The following code/outcome baffles me as to why data.table returns NA for the mean functions and not the sd function. library(data.table) test <- data.frame('id'=c(1,2,3,4,5), 'A'=seq(2,9,length=5), 'B'=seq(3,9,length=5), 'C'=seq(4,9,length=5), 'D'=seq(5,9,length=5)) test <- as.data.table(test) test[,`:=`(mean_test = mean(.SD), sd_test = sd(.SD)),by=id,.SDcols=c('A','B','C','D')] > test id A B C D mean_test sd_test 1: 1 2.00 3.0 4.00 5 NA 1.2909944 2: 2 3.75 4.5 5.25 6 NA 0.9682458 3: 3 5.50

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions

拈花ヽ惹草 提交于 2019-12-21 03:42:48
參考:http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter 三种方法评估模型的预測质量: Estimator score method : Estimators都有 score method作为默认的评估标准,不属于本节内容。详细參考不同estimators的文档。 Scoring parameter : Model-evaluation tools using cross-validation (such as cross_validation.cross_val_score and grid_search.GridSearchCV ) rely on an internal scoring strategy. 本节讨论 The scoring parameter: defining model evaluation rules .(參考第一小节) Metric functions : The metrics module 能较全面评价预測质量,本节讨论 Classification metrics , Multilabel ranking metrics , Regression metrics and Clustering metrics .(參考二、三、四、五小节)

compute mean in python for a generator

十年热恋 提交于 2019-12-21 03:39:28
问题 I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers. The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this? It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum

Finding the mean and standard deviation of a timedelta object in pandas df

馋奶兔 提交于 2019-12-21 03:37:34
问题 I would like to calculate the mean and standard deviation of a timedelta by bank from a dataframe with two columns shown below. When I run the code (also shown below) I get the below error: pandas.core.base.DataError: No numeric types to aggregate My dataframe: bank diff Bank of Japan 0 days 00:00:57.416000 Reserve Bank of Australia 0 days 00:00:21.452000 Reserve Bank of New Zealand 55 days 12:39:32.269000 U.S. Federal Reserve 8 days 13:27:11.387000 My code: means = dropped.groupby('bank')

using mean with .SD and .SDcols in data.table

血红的双手。 提交于 2019-12-21 02:52:27
问题 I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets. So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply