Merge data files

♀尐吖头ヾ 提交于 2019-12-12 04:58:30

问题


I have the following data frames in R:

Id   Class
@a    64
@b    7
@c    98 

And the second data frame:

SOURCE    TARGET 
@d        @b
@c        @a 

This is describes the nodes and the edges in a social network. The users (all with @ in front) belong to a specific community and the number is listed in column class. To analyse the connections between the columns I want to merge this data frames and create a new data frame looking like this:

SOURCE    TARGET    SOURCE.Class    TARGET.Class 
@a        @i        56               2
@f        @k        90               49 

When I try merge() R stop responding and I need to terminate R. The data frames constitute 20000 (node file) and 30000 (edge file) rows.

Then I want to know how many records in a given source class have the same target class and percentage of connections between classes.

I will be so happy if someone could help me since I'm very new to R.

EDIT: I think I manage to create the columns by this code using match() instead of merge() (rt_node contain the columns "id", "class" and rt_node contain the columns "source","target"):

#match source in rt_edges with id in rt_node
match(rt_edges$Source,rt_nodes$id)

#match target in rt_edges with id in rt_node
match(rt_edges$Target,rt_nodes$id)

#create source_class 
rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]
rt_edges$Source_Class=rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]

#create target_class
rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]
rt_edges$Target_Class=rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]

Now I just need to figure out how I can find the percentage of connections in each class and the percentage of connections with other classes. Any tips on how to do that?


回答1:


Question 1: Merge

This requires two separate join operations: An initial join of rt_edges with rt_nodes on Target and a subsequent join of the intermediate result with rt_nodes on Source. In addition, all rows of rt_edges should appear in the result.

The approach below uses data.table. (I've adopted the naming of variables and columns the OP has used in the edited code of his Q but note that this is inconsistent to the sample data given by the OP.)

Reading data

library(data.table)
rt_nodes <- fread(
  "id   Class
  @a    64
  @b    7
  @c    98
  @d    23
  @f    59")
rt_edges <-fread(
  "Source    Target 
  @d        @b
  @c        @a
  @a        @e")

Note that additional rows have been added to the sample data provided by the OP to demonstrate the effect of

  • a node (@f) not involved in an edge and
  • an edge (@a -> @e) where one id is missing from rt_nodes.

Twofold join

By default, joins in data.table are right joins. Therefore, rt_edges appears on the right side.

result <- rt_nodes[rt_nodes[rt_edges, on = c(id = "Target")], on = c(id = "Source")]

# rename columns
setnames(result, c("Source", "Source.Class", "Target", "Target.Class"))

result
#   Source Source.Class Target Target.Class
#1:     @d           23     @b            7
#2:     @c           98     @a           64
#3:     @a           64     @e           NA

All three edges appear in the result. The NA indicates that @e is missing from rt_nodes.

Question 2

The OP has included a second question (and has also created a new post in the meantime)

Then I want to know how many records in a given source class have the same target class and percentage of connections between classes.

result[, .(.N, share_of_occurrence_in_Target.Class = sum(Source.Class == Target.Class)/.N), 
       by = Source.Class]
#   Source.Class N share_of_occurrence_in_Target.Classs
#1:           23 1                                    0
#2:           98 1                                    0
#3:           64 1                                   NA

The counts are 1 and the shares are 0 here because the sample data don't include enough cases of matching classes. However, the code has been verified to work with the data provided in the other post of the OP.



来源:https://stackoverflow.com/questions/43275478/merge-data-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!