问题
I have a data.frame with 93 millon elements and 3 numeric variables. The third variable, "component" groups some rows by an id. The data consists of an edge list of a huge graph, the component number indicates the rows that belong to the same connected component. There are about 83 million such components.
I am now trying t split the data frame into a list 83 million of data.frames. I do this in order to apply some igraph functions to each component.
This SO answer indicates that split()
is the solution for this.
library(dplyr,data.table,igraph)
# d6b: data.frame with edge A, edge B, component, 93 millon rows, 83 million components, object.size=2,4Gb
d6b <- d6a %>% split(f = d6a$component )
# This takes 7,1 hours to run, and creates a 94.8 Gb object
#Then try to run igraph on each element of the list
d6b %>% lapply(graph_from_data_frame,directed = TRUE) -> g6a
#code above ran for 20 hours without finishing
Is there a faster way to do this? Is there another structure that does not become so large?
EDIT: based on Gregor's comment bellow I changed the workflow:
#Selecting only the non trivial components
# removing all 1:n or n:1 (incluind the 70mi 1:1)
d6a %>% group_by(component) %>%
mutate(N_edges=n(),
N_cpf=n_distinct(cpf),
N_pis=n_distinct(pis)) -> d6b #takes 1h
d6b_dt <- data.table(d6b) # takes 11min
d6b_dtf <- d6b_dt[N_cpf>1 & N_pis>1] # 5s
setkey(d6b_dtf, component) #1s
Then try to implement the suggestion:
d6b_dtf %>% group_by(component) %>% select(cpf,pis) %>%
do(graph_from_data_frame, directed = TRUE) -> g_d6b_dtf
I get the following error message:
Adding missing grouping variables: `component`
Error: Arguments to do() must either be all named or all unnamed
来源:https://stackoverflow.com/questions/39239930/fast-split-of-data-frame-into-list-of-data-frames-in-r