Force R to plot histogram as probability (relative frequency)

巧了我就是萌 提交于 2019-12-20 09:02:54

问题


I am having trouble plotting a histogram as a pdf (probability)

I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.

hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)

What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)

This is in the R help, but I don't know how to override it:

freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).

Thanks

edit: details

hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0

edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?

I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?


回答1:


To answer the request to plot probabilities rather than densities:

h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)



回答2:


Are you sure? This is working for me:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

This is with

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1



回答3:


The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))

you should get a result of 1.


The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

Does the value 1.0 have some kind of significance?




回答4:


I observed that, in histogram density = relative frequency / corresponding bin width

Example 1:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h2 = hist(nums, plot=F)

rf2 = h2$counts / sum(h2$counts)

d2 = rf2 / diff(h2$breaks)

h2$density

[1] 0.06 0.00 0.02 0.01 0.01

d2

[1] 0.06 0.00 0.02 0.01 0.01

Example 2:

nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)

h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

rf3 = h3$counts / sum(h3$counts)

d3 = rf3 / diff(h3$breaks)

h3$density

[1] 0.02758621 0.01000000 0.01000000

d3

[1] 0.02758621 0.01000000 0.01000000




回答5:


R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

The solution is to call unlist() on the object first. This fixes the plot.

(I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

来源:https://stackoverflow.com/questions/17416453/force-r-to-plot-histogram-as-probability-relative-frequency

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!