问题
I have the following data frames in R:
Id Class
@a 64
@b 7
@c 98
And the second data frame:
SOURCE TARGET
@d @b
@c @a
This is describes the nodes and the edges in a social network. The users (all with @ in front) belong to a specific community and the number is listed in column class. To analyse the connections between the columns I want to merge this data frames and create a new data frame looking like this:
SOURCE TARGET SOURCE.Class TARGET.Class
@a @i 56 2
@f @k 90 49
When I try merge()
R stop responding and I need to terminate R. The data frames constitute 20000 (node file) and 30000 (edge file) rows.
Then I want to know how many records in a given source class have the same target class and percentage of connections between classes.
I will be so happy if someone could help me since I'm very new to R.
EDIT:
I think I manage to create the columns by this code using match()
instead of merge()
(rt_node contain the columns "id", "class" and rt_node contain the columns "source","target"):
#match source in rt_edges with id in rt_node
match(rt_edges$Source,rt_nodes$id)
#match target in rt_edges with id in rt_node
match(rt_edges$Target,rt_nodes$id)
#create source_class
rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]
rt_edges$Source_Class=rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]
#create target_class
rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]
rt_edges$Target_Class=rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]
Now I just need to figure out how I can find the percentage of connections in each class and the percentage of connections with other classes. Any tips on how to do that?
回答1:
Question 1: Merge
This requires two separate join operations: An initial join of rt_edges
with rt_nodes
on Target
and a subsequent join of the intermediate result with rt_nodes
on Source
. In addition, all rows of rt_edges
should appear in the result.
The approach below uses data.table
. (I've adopted the naming of variables and columns the OP has used in the edited code of his Q but note that this is inconsistent to the sample data given by the OP.)
Reading data
library(data.table)
rt_nodes <- fread(
"id Class
@a 64
@b 7
@c 98
@d 23
@f 59")
rt_edges <-fread(
"Source Target
@d @b
@c @a
@a @e")
Note that additional rows have been added to the sample data provided by the OP to demonstrate the effect of
- a node (
@f
) not involved in an edge and - an edge (
@a -> @e
) where one id is missing fromrt_nodes
.
Twofold join
By default, joins in data.table
are right joins. Therefore, rt_edges
appears on the right side.
result <- rt_nodes[rt_nodes[rt_edges, on = c(id = "Target")], on = c(id = "Source")]
# rename columns
setnames(result, c("Source", "Source.Class", "Target", "Target.Class"))
result
# Source Source.Class Target Target.Class
#1: @d 23 @b 7
#2: @c 98 @a 64
#3: @a 64 @e NA
All three edges appear in the result. The NA
indicates that @e
is missing from rt_nodes
.
Question 2
The OP has included a second question (and has also created a new post in the meantime)
Then I want to know how many records in a given source class have the same target class and percentage of connections between classes.
result[, .(.N, share_of_occurrence_in_Target.Class = sum(Source.Class == Target.Class)/.N),
by = Source.Class]
# Source.Class N share_of_occurrence_in_Target.Classs
#1: 23 1 0
#2: 98 1 0
#3: 64 1 NA
The counts are 1 and the shares are 0 here because the sample data don't include enough cases of matching classes. However, the code has been verified to work with the data provided in the other post of the OP.
来源:https://stackoverflow.com/questions/43275478/merge-data-files