fast split of data.frame into list of data.frames in R

问题

I have a data.frame with 93 millon elements and 3 numeric variables. The third variable, "component" groups some rows by an id. The data consists of an edge list of a huge graph, the component number indicates the rows that belong to the same connected component. There are about 83 million such components.

I am now trying t split the data frame into a list 83 million of data.frames. I do this in order to apply some igraph functions to each component.

This SO answer indicates that split() is the solution for this.

library(dplyr,data.table,igraph)

# d6b: data.frame with edge A, edge B, component, 93 millon rows, 83 million components, object.size=2,4Gb
d6b <- d6a %>% split(f = d6a$component )
# This takes 7,1 hours to run, and creates a 94.8 Gb object

#Then try to run igraph on each element of the list
d6b %>% lapply(graph_from_data_frame,directed = TRUE) -> g6a
#code above ran for 20 hours without finishing

Is there a faster way to do this? Is there another structure that does not become so large?

EDIT: based on Gregor's comment bellow I changed the workflow:

#Selecting only the non trivial components 
# removing all 1:n or n:1 (incluind the 70mi 1:1)
d6a %>% group_by(component) %>% 
  mutate(N_edges=n(),
         N_cpf=n_distinct(cpf),
         N_pis=n_distinct(pis)) -> d6b #takes 1h
d6b_dt <- data.table(d6b) # takes 11min
d6b_dtf <- d6b_dt[N_cpf>1 & N_pis>1] # 5s
setkey(d6b_dtf, component) #1s

Then try to implement the suggestion:

d6b_dtf %>% group_by(component) %>% select(cpf,pis) %>% 
  do(graph_from_data_frame, directed = TRUE) -> g_d6b_dtf

I get the following error message:

Adding missing grouping variables: `component`
Error: Arguments to do() must either be all named or all unnamed

来源：https://stackoverflow.com/questions/39239930/fast-split-of-data-frame-into-list-of-data-frames-in-r

标签

list

split

igraph