SuperImpose Histogram fits in one plot ggplot

二次信任 提交于 2019-11-29 04:13:00

问题


I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.

I am trying to visualize their distribution (histogram), and was wondering what would be the best way to superimpose their histogram distributions in R without taking too long. I am thinking to first fit a distribution to the histogram, and then plot all the distribution line fits together in one plot.

Do you have some suggestions on how to do that?

Let us say my vectors are:

x1, x2, x3, x4, x5.

I am trying to use this code: Overlaying histograms with ggplot2 in R

Example of the code I am using for 3 vectors (R fails to do the plot):

n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) + 
    geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)

but it takes forever to produce the plot, and eventually it kicks me out of R. Any ideas on how to use ggplot2 efficiently for large vectors? Seems to me that I had to create a dataframe, of 5*108MM entries and then plot, highly inefficient in my case.

Thanks!


回答1:


Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:

library(Rcpp)
cppFunction('
  std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
    int bin, nmissing = 0;
    std::vector<int> out;

    NumericVector::iterator x_it = x.begin(), x_end;
    for(; x_it != x.end(); ++x_it) {
      double val = *x_it;
      if (ISNAN(val)) {
        ++nmissing;
      } else {
        bin = (val - origin) / width;
        if (bin < 0) continue;

        // Make sure there\'s enough space
        if (bin >= out.size()) {
          out.resize(bin + 1);
        }
        ++out[bin];
      }
    }

    // Put missing values in the last position
    out.push_back(nmissing);
    return out;
  }
')

x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
#   user  system elapsed 
#  1.373   0.000   1.373 

That said, hist is pretty fast here too:

system.time(hist(x8, breaks = 100, plot = F))
#   user  system elapsed 
#  7.281   1.362   8.669 

It's straightforward to use bin3 to make a histogram or frequency polygon:

# First we create some sample data, and bin each column

library(reshape2)
library(ggplot2)

df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))

# Next we match up the bins with the breaks
binsdf <- data.frame(
  breaks = c(seq(0, 1, length = 100), NA),
  bins)

# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)

FYI, the reason I had bin3 on hand is that I'm working on how to make this speed the default in ggplot2 :)



来源:https://stackoverflow.com/questions/13661065/superimpose-histogram-fits-in-one-plot-ggplot

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!