问题
I am having trouble plotting a histogram as a pdf (probability)
I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.
hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)
What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)
This is in the R help, but I don't know how to override it:
freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).
Thanks
edit: details
hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0
edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?
I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?
回答1:
To answer the request to plot probabilities rather than densities:
h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
回答2:
Are you sure? This is working for me:
> vec <- rnorm(6000000)
>
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01
Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.
This is with
> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
回答3:
The default number of breaks is around log2(N)
where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim
in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do
h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))
you should get a result of 1.
The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.
Does the value 1.0 have some kind of significance?
回答4:
I observed that, in histogram density = relative frequency / corresponding bin width
Example 1:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h2 = hist(nums, plot=F)
rf2 = h2$counts / sum(h2$counts)
d2 = rf2 / diff(h2$breaks)
h2$density
[1] 0.06 0.00 0.02 0.01 0.01
d2
[1] 0.06 0.00 0.02 0.01 0.01
Example 2:
nums = c(10, 41, 10, 28, 22, 8, 31, 3, 9, 9)
h3 = hist(nums, plot=F, breaks=c(1,30,40,50))
rf3 = h3$counts / sum(h3$counts)
d3 = rf3 / diff(h3$breaks)
h3$density
[1] 0.02758621 0.01000000 0.01000000
d3
[1] 0.02758621 0.01000000 0.01000000
回答5:
R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.
The solution is to call unlist() on the object first. This fixes the plot.


来源:https://stackoverflow.com/questions/17416453/force-r-to-plot-histogram-as-probability-relative-frequency