SuperImpose Histogram fits in one plot ggplot

问题

I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.

I am trying to visualize their distribution (histogram), and was wondering what would be the best way to superimpose their histogram distributions in R without taking too long. I am thinking to first fit a distribution to the histogram, and then plot all the distribution line fits together in one plot.

Do you have some suggestions on how to do that?

Let us say my vectors are:

x1, x2, x3, x4, x5.

I am trying to use this code: Overlaying histograms with ggplot2 in R

Example of the code I am using for 3 vectors (R fails to do the plot):

n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) + 
    geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)

but it takes forever to produce the plot, and eventually it kicks me out of R. Any ideas on how to use ggplot2 efficiently for large vectors? Seems to me that I had to create a dataframe, of 5*108MM entries and then plot, highly inefficient in my case.

Thanks!

回答1:

Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:

library(Rcpp)
cppFunction('
  std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
    int bin, nmissing = 0;
    std::vector<int> out;

    NumericVector::iterator x_it = x.begin(), x_end;
    for(; x_it != x.end(); ++x_it) {
      double val = *x_it;
      if (ISNAN(val)) {
        ++nmissing;
      } else {
        bin = (val - origin) / width;
        if (bin < 0) continue;

        // Make sure there\'s enough space
        if (bin >= out.size()) {
          out.resize(bin + 1);
        }
        ++out[bin];
      }
    }

    // Put missing values in the last position
    out.push_back(nmissing);
    return out;
  }
')

x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
#   user  system elapsed 
#  1.373   0.000   1.373

That said, hist is pretty fast here too:

system.time(hist(x8, breaks = 100, plot = F))
#   user  system elapsed 
#  7.281   1.362   8.669

It's straightforward to use bin3 to make a histogram or frequency polygon:

# First we create some sample data, and bin each column

library(reshape2)
library(ggplot2)

df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))

# Next we match up the bins with the breaks
binsdf <- data.frame(
  breaks = c(seq(0, 1, length = 100), NA),
  bins)

# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)

FYI, the reason I had bin3 on hand is that I'm working on how to make this speed the default in ggplot2 :)

来源：https://stackoverflow.com/questions/13661065/superimpose-histogram-fits-in-one-plot-ggplot

标签

plot

histogram

ggplot2