statistics

How to shade a region under a curve using ggplot2

柔情痞子 提交于 2019-12-17 09:38:09
问题 I've been trying to use ggplot2 to produce a plot similar to this R graphic: xv<-seq(0,4,0.01) yv<-dnorm(xv,2,0.5) plot(xv,yv,type="l") polygon(c(xv[xv<=1.5],1.5),c(yv[xv<=1.5],yv[xv==0]),col="grey") This is as far as I've gotten with ggplot2: x<-seq(0.0,0.1699,0.0001) ytop<-dnorm(0.12,0.08,0.02) MyDF<-data.frame(x=x,y=dnorm(x,0.08,0.02)) p<-qplot(x=MyDF$x,y=MyDF$y,geom="line") p+geom_segment(aes(x=0.12,y=0,xend=0.12,yend=ytop)) I would like to shade the tail region beyond x=0.12. How would I

git find fat commit

天涯浪子 提交于 2019-12-17 08:56:31
问题 Is it possible to get info about how much space is wasted by changes in every commit — so I can find commits which added big files or a lot of files. This is all to try to reduce git repo size (rebasing and maybe filtering commits) 回答1: You could do this: git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 This will show the largest files at the bottom (fourth column is the file (blob) size. If you need to look at different branches you'll want to change HEAD to those branch names. Or, put

Constructing a co-occurrence matrix in python pandas

大兔子大兔子 提交于 2019-12-17 08:25:17
问题 I know how to do this in R. But, is there any function in pandas that transforms a dataframe to an nxn co-occurrence matrix containing the counts of two aspects co-occurring. For example a matrix df: import pandas as pd df = pd.DataFrame({'TFD' : ['AA', 'SL', 'BB', 'D0', 'Dk', 'FF'], 'Snack' : ['1', '0', '1', '1', '0', '0'], 'Trans' : ['1', '1', '1', '0', '0', '1'], 'Dop' : ['1', '0', '1', '0', '1', '1']}).set_index('TFD') print df >>> Dop Snack Trans TFD AA 1 1 1 SL 0 0 1 BB 1 1 1 D0 0 1 0

How do I calculate r-squared using Python and Numpy?

不羁的心 提交于 2019-12-17 07:02:29
问题 I'm using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.). This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel's best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best

How to plot empirical cdf in matplotlib in Python?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-17 07:00:28
问题 How can I plot the empirical CDF of an array of numbers in matplotlib in Python? I'm looking for the cdf analog of pylab's "hist" function. One thing I can think of is: from scipy.stats import cumfreq a = array([...]) # my array of numbers num_bins = 20 b = cumfreq(a, num_bins) plt.plot(b) Is that correct though? Is there an easier/better way? thanks. 回答1: That looks to be (almost) exactly what you want. Two things: First, the results are a tuple of four items. The third is the size of the

How do I calculate percentiles with python/numpy?

余生长醉 提交于 2019-12-17 04:41:48
问题 Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array? I am looking for something similar to Excel's percentile function. I looked in NumPy's statistics reference, and couldn't find this. All I could find is the median (50th percentile), but not something more specific. 回答1: You might be interested in the SciPy Stats package. It has the percentile function you're after and many other statistical goodies. percentile() is available in numpy too.

Nth Combination

喜夏-厌秋 提交于 2019-12-17 04:00:53
问题 Is there a direct way of getting the Nth combination of an ordered set of all combinations of nCr? Example: I have four elements: [6, 4, 2, 1]. All the possible combinations by taking three at a time would be: [[6, 4, 2], [6, 4, 1], [6, 2, 1], [4, 2, 1]]. Is there an algorithm that would give me e.g. the 3rd answer, [6, 2, 1], in the ordered result set, without enumerating all the previous answers? 回答1: Note you can generate the sequence by recursively generating all combinations with the

Calculating Pearson correlation and significance in Python

你说的曾经没有我的故事 提交于 2019-12-17 02:53:04
问题 I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation. 回答1: You can have a look at scipy.stats: from pydoc import help from scipy.stats.stats import pearsonr help(pearsonr) >>> Help on function pearsonr in module scipy.stats.stats: pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation. The Pearson correlation coefficient measures the linear relationship between

Multiple linear regression in Python

走远了吗. 提交于 2019-12-17 02:39:07
问题 I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.). For example, with this data: print 'y x1 x2 x3 x4 x5 x6 x7' for t in texts: print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" / .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7) (output for above:) y x1 x2 x3 x4 x5 x6 x7 -6.0 -4.95 -5.87 -0.76 14.73 4

Browser statistics on JavaScript disabled [closed]

折月煮酒 提交于 2019-12-17 00:00:11
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I am having a hard time collecting publically available statistics on the percentage of web users that browse with JavaScript disabled. Yahoo has published data from 2010 and R. Reid published data from 2009 (picked from a site he had access to). The findings from Yahoo were rather interesting at that time: We