问题
I would like to convert the following data.frame into a matrix where the the number of each bike station id that occurs per hour is counted.
> dim(test) [1] 80623 5 head(test, n = 10) bikeid end.station.id start.station.id diff.time hour 1 16052 244 322 6544 14 2 16052 284 432 3406 21 3 16052 461 519 33416 3 4 16052 228 519 26876 13 5 16052 72 435 388 17 6 16052 319 127 27702 11 7 16052 282 2002 33882 4 8 16052 524 2021 2525 10 9 16052 387 351 2397 12 10 16052 388 526 32507 13
The output should look like this.
> sample2 start.station.id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 72 44 1 42 22 9 33 39 47 12 30 39 52 43 45 40 62 9 35 24 43 65 59 58 34 2 79 21 11 2 42 5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35 4 50 31 3 82 19 44 7 52 14 19 3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54 7 4 83 45 60 64 5 0 3 54 16 48 67 49 20 59 21 24 38 42 62 38 24 1 35 16 4 5 116 27 62 64 44 55 65 23 13 36 0 62 54 61 6 16 7 58 41 29 1 34 58 35 67 6 119 45 30 41 26 7 39 16 55 28 53 42 9 5 31 18 16 14 37 17 14 16 17 23 50 7 120 3 2 7 53 21 33 31 48 19 50 35 47 8 17 30 9 49 4 48 28 52 9 57 55 8 127 33 44 47 42 6 46 39 30 39 28 19 57 53 41 45 55 9 27 42 19 43 24 37 55 9 137 53 11 60 1 66 37 16 5 2 58 0 46 33 0 60 54 25 66 65 40 36 47 58 40 10 143 61 1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56
I was advised to use a formula similar to :
matrix <- test %>%
group_by(start.station.id, hour)%>%
summarise(sum = nrow) %>%
spread(hour, nrow)
but do not know how to code it properly
回答1:
Using data.table
:
library(data.table) #1.9.6+
setDT(test)
dcast(test[ , .N, by = .(start.station.id, hour)],
start.station.id ~ hour, value.var = "N")
Alternatively (slower, though cleaner):
dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")
Testing on some fake data:
set.seed(10932)
NN <- 1e6
test <- data.table(start.station.id = sample(1000, NN, T),
hour = sample(24, NN, T))
library(microbenchmark)
microbenchmark(times = 100L,
preagg = dcast(test[ , .N, by = .(start.station.id, hour)],
start.station.id ~ hour, value.var = "N"),
postagg = dcast(test, start.station.id ~ hour,
fun.aggregate = length, value.var = "hour"))
Unit: milliseconds
expr min lq mean median uq max neval
preagg 55.83240 59.88939 66.56289 61.37408 64.37049 166.8902 100
postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270 100
The reason the first is faster is that the operation test[ , .N, by = vars]
has been optimized in data.table
.
来源:https://stackoverflow.com/questions/35155603/groupby-and-transform-two-columns-in-df-into-matrix-r