问题
I have a data set that looks something like this:
id1 id2 size
1 5400 5505 7
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
7 5505 5400 11
Where id1
and id2
are unique nodes in a graph, and size
is a value assigned to the directed edge connecting them from id1
to id2
. This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of id1
and id2
. For example, in the first row, we have id1=5400
and id2=5505
. There exists another row in the data frame where id1=5505
and id2=5400
. In the grouped data, the sum of the size columns for these two rows would be added to a single row. So in other words I want to summarize the data where I\'m grouping on an (unordered) set of (id1,id2). I\'ve found a way to do this using apply
with a custom function that checks for the reversed column pair in the full data set, but this works excruciatingly slow. Does anyone know of a way to do this another way, perhaps with plyr
or with something in the base packages that would be more efficient?
回答1:
One way is to create extra columns with pmax
and pmin
of id1
and id2
as follows. I'll use data.table
solution here.
require(data.table)
DT <- data.table(DF)
# Following mnel's suggestion, g1, g2 could be used directly in by
# and it could be even shortened by using `id1` and id2` as their names
DT.OUT <- DT[, list(size=sum(size)),
by=list(id1 = pmin(id1, id2), id2 = pmax(id1, id2))]
# id1 id2 size
# 1: 5400 5505 18
# 2: 5033 5458 1
# 3: 5452 2873 24
# 4: 5452 5213 2
# 5: 5452 4242 26
# 6: 4823 4823 4
回答2:
an alternate method:
R> library(igraph)
R> DF
id1 id2 size
1 5400 5505 7
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
7 5505 5400 11
R> g <- graph.data.frame(DF, directed=F)
R> g <- simplify(g, edge.attr.comb="sum", remove.loops=FALSE)
R> DF <- get.data.frame(g)
R> DF
id1 id2 size
1 5400 5505 18
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
回答3:
My method with aggregate{stats}
function :
> df
id1 id2 size
1 5400 5505 7
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
7 5505 5400 11
> df[1:2] <- t(apply(df[1:2], 1, sort))
> aggregate(size ~ id1 + id2, data=df, FUN=sum)
id1 id2 size
1 4823 4823 4
2 2873 5452 24
3 4242 5452 26
4 5213 5452 2
5 5033 5458 1
6 5400 5505 18
来源:https://stackoverflow.com/questions/15487151/aggregate-a-data-frame-based-on-unordered-pairs-of-columns