问题
I have a set of data looks like this
CHROM POS GT DIFF
1 chr01 14653 CT 254
2 chr01 14907 AG 254
3 chr01 14930 AG 23
4 chr01 15190 GA 260
5 chr01 15211 TG 21
6 chr01 16378 TC 1167
Where POS range from 1xxxx to 1xxxxxxx. And CHROM is a categorical variable that contains values of "chr01" to "chr22" and "chrX".
I want to plot a scatterplot:
- y(DIFF) vs. X(POS)
- having panels separated by CHROM
- grouped by GT (different colors by GT)
I'm creating a ggplot with running average (though not time series data).
What I want is to get average for every 1,000,000 range of POS by GT.
For example,
for x in range(1 ~ 1,000,000) , DIFF average = _____
for x in range(1,000,001 ~ 2,000,000), DIFF average = _____
and I want to plot horizontal lines on the ggplot (coloured by GT).
#What I have so far before apply your function:

After apply your function:

I tried to apply your solution to what I already have, here are some problems:
- There are different panels, so the mean values are different for different panel, but when I apply your code, the horizontal mean lines are all identical to the first panel.
- I'm having different ranges for x-axis, so when apply your function, it automatically fills out the extra range with the previous horizontal mean line
Here is my code before:
ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) +
geom_point() +
facet_grid(~ CHROM,scales="free_x",space="free_x") +
theme(strip.text.x = element_text(size=40),
strip.background = element_rect(color='lightblue',fill='lightblue'),
legend.position="top",
legend.title = element_text(size=40,colour="darkblue"),
legend.text = element_text(size=40),
legend.key.size = unit(2.5, "cm")) +
guides(fill = guide_legend(title.position="top",
title = "Legend:GT='REF'+'ALT'"),
shape = guide_legend(override.aes=list(size=10))) +
scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) +
scale_x_continuous(breaks = pretty_breaks(n=3))
回答1:
This was tougher than I expected! This should at least get you started, though:
# It saves a lot of headaches to just make factors as you need them
options(stringsAsFactors = FALSE)
library(ggplot2)
library(plyr)
# Here's some made-up data - it always helps if you can post a subset of
# your real data, though. The dput() function is really useful for that.
dat <- data.frame(POS = seq(1, 1e7, by = 1e4))
# Add random GT value
dat$GT <- sample(x = c("CT", "AG", "GA", "TG", "TC"),
size = nrow(dat),
replace = TRUE)
# Group by millions - there are several ways to do this that I can
# never remember, but here's a simple way to split by millions
dat$POSgroup <- floor(dat$POS / 1e6)
# Add an arbitrary DIFF value
dat$DIFF <- rnorm(n = nrow(dat),
mean = 200 * dat$POSgroup,
sd = 300)
# Aggregate the data by GT and POS-group
# Ideally, you'd do this inside of the plot using stat_summary,
# but I couldn't get that to work. Using two datasets in a plot
# is okay, though.
datsum <- ddply(dat, .var = "POSgroup", .fun = function(x) {
# Calculate the mean DIFF value for each GT group in this POSgroup
meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF))
# Add the center of the POSgroup range as the x position
meandiff$center <- (x$POSgroup[1] * 1e6) + 0.5e6
# Return the results
meandiff
})
# On the plot, these results will be grouped by both POS and GT - but
# ggplot will only accept one vector for grouping. So make a combination.
datsum$combogroup <- paste(datsum$GT, datsum$POSgroup)
# Plot it
ggplot() +
# First, a layer for the points themselves
# Large numbers of points can get pretty slow - you might try getting
# the plot to work with a subsample (~1000) and then add in the rest of
# your data
geom_point(data = dat,
aes(x = POS, y = DIFF, color = as.factor(GT))) +
# Then another layer for the means. There are a variety of geoms you could
# use here, but crossbar with ymin and ymax set to the group mean
# is a simple one
geom_crossbar(data = datsum, aes(x = center,
y = ymean,
ymin = ..y..,
ymax = ..y..,
color = as.factor(GT),
group = combogroup),
size = 1) +
# Some other niceties
scale_x_continuous(breaks = seq(0, 1e7, by = 1e6)) +
labs(x = "POS", y = "DIFF", color = "GT") +
theme_bw()
Which results in this:

There's probably a more straightforward way to do this, but I don't know it. Hope this helps.
来源:https://stackoverflow.com/questions/21532790/r-ggplot-separate-mean-by-range-of-x-value