问题
I have daily data of rainfall from 10 locations across 10 years
set.seed(123)
df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))
I have a separate data frame that has certain days using which I want to sum the rainfall in df
df.ref <- data.frame(loc.id = rep(1:10, each = 10),
years = rep(2001:2010,times = 10),
index1 = rep(250,times = 10*10),
index2 = sample(260:270, size = 10*10,replace = T),
index3 = sample(280:290, size = 10*10,replace = T),
index4 = sample(291:300, size= 10*10,replace = T))
df.ref
loc.id years index1 index2 index3 index4
1: 1 2001 250 264 280 296
2: 1 2002 250 269 284 298
3: 1 2003 250 268 289 293
4: 1 2004 250 266 281 295
5: 1 2005 250 260 289 293
What I want to is for row in in df.ref, use the index values in df.ref and
sum the rainfall in df between index1 to index2, index1 to index3 and index1 to index4. For example:
Using df.ref, for loc.id = 1 and year == 2001, sum the rainfall in df from 250 to 264, 250 to 280, 250 to 296 (as shown in df.ref)
Similarly, for year 2002, for loc.id = 1, sum the rainfall from 250 to 269, 250 to 284, 250 to 298.
I did this:
library(dplyr)
ptm <- proc.time()
dat <- df.ref %>% left_join(df)
index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1 = sum(rain))
index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))
index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))
all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))
proc.time() - ptm
user system elapsed
2.36 0.64 3.06
I am looking to make my code faster since my actual df.ref is quite large. Could anyone advise me how to make this quicker.
回答1:
Non-equi join from data.table package can be both faster and more memory efficient than dplyr::left_join (slide | video)
For each value in df, find all the rain values in df.ref that have day in between index 1 and index 2. Then calculate the summation of rain based on loc.id and years.
df1 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index2)
, by = .EACHI][
][
, c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
# remove all redundant columns
, day := NULL][
, day := NULL][
, rain := NULL])
df2 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index3)
, by = .EACHI][
][
, c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
df3 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index4)
, by = .EACHI][
][
, c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
Merge all three data.tables together
df1[df2, on = .(loc.id, years)][
df3, on = .(loc.id, years)]
loc.id years sum_1 sum_2 sum_3
1: 1 1950 104159.11 222345.4 271587.1
2: 1 1951 118689.90 257450.2 347624.3
3: 1 1952 99262.27 212923.7 280877.6
4: 1 1953 72435.50 192072.7 251593.6
5: 1 1954 104021.19 242525.3 326463.4
6: 1 1955 93436.32 232653.1 304921.4
7: 1 1956 89122.79 190424.4 255535.0
8: 1 1957 135658.11 262918.7 346361.4
9: 1 1958 80064.18 220454.8 292966.4
10: 1 1959 114231.19 273181.0 349489.2
11: 2 1950 94360.69 238296.8 301751.8
12: 2 1951 93845.50 195273.7 289686.0
13: 2 1952 107692.53 245019.4 308093.7
14: 2 1953 86650.14 257225.1 332674.1
15: 2 1954 104085.83 238859.4 286350.7
16: 2 1955 101602.16 223107.3 300958.4
17: 2 1956 73912.77 198087.2 276590.1
18: 2 1957 117780.86 228299.8 305348.5
19: 2 1958 98625.45 220902.6 291583.7
20: 2 1959 109851.38 266745.2 324246.8
[ reached getOption("max.print") -- omitted 81 rows ]
Compare processing time and memory used
> time_dplyr; time_datatable
user system elapsed
2.17 0.27 2.61
user system elapsed
0.45 0.00 0.69
rowname Class MB
1 dat data.frame 508
2 df3 data.table 26
3 df2 data.table 20
4 df1 data.table 9
When testing for about 100 years of data, dplyr used more than 50 GB of memory while data.table consumed only 5 GB. dplyr also took about 4 times longer to finish.
'data.frame': 3650000 obs. of 4 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ rain : num 10.1 27.6 14.3 30.9 32.9 ...
'data.frame': 3650000 obs. of 6 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
$ index1: num 250 250 250 250 250 250 250 250 250 250 ...
$ index2: int 270 265 262 267 266 265 262 268 260 268 ...
$ index3: int 290 287 286 289 281 285 286 285 284 283 ...
$ index4: int 298 297 296 295 298 294 296 298 298 300 ...
> time_dplyr; time_datatable
user system elapsed
95.010 33.704 128.722
user system elapsed
26.175 3.147 29.312
rowname Class MB
1 dat data.frame 50821
2 df3 data.table 2588
3 df2 data.table 2004
4 df1 data.table 888
5 df.ref data.table 97
6 df data.table 70
If I increased the number of years to 150, dplyr broke even on a HPC cluster node with 256 GB RAM
Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted
回答2:
Here's a starting point that will be much faster. Should be trivial figuring out the rest.
library(data.table)
setDT(df)
df[df.ref, on = .(loc.id, years, day >= index1, day <= index2), sum(rain), by = .EACHI]
来源:https://stackoverflow.com/questions/49544285/summing-a-dataframe-based-on-another-dataframe