How to bind data.table without increasing the memory consumption?

后端未结

关注

 3  498

I have few huge datatable dt_1, dt_2, ..., dt_N with same cols. I want to bind them together into a single datatable. If I use

dt


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-21 09:05
              
            
            
                                                                       
I guess <<- and get can help you with this.

UPDATE: <<- is not necessary.

df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- df1
df3 <- df1

dt.lst <- c("df2", "df3")

for (i in dt.lst) {
  df1 <- rbind(df1, get(i))
  rm(list=i)
}

df1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2020-12-21 09:20
              
            
            
                                                                       
You can remove your datatables after you've bound them, the double memory-usage is caused by the new dataframe consisting of copies.

Illustration:

#create some data
nobs=10000
d1 <- d2 <- d3 <-  data.table(a=rnorm(nobs),b=rnorm(nobs))
dt <- rbindlist(list(d1,d2,d3))


Then we can look at memory-usage per object source

sort( sapply(ls(),function(x){object.size(get(x))}))
  nobs     d1     d2     d3     dt 
    48 161232 161232 161232 481232 


If the memory-usage is so large the separate datatables and combined datatable cannot coexist, we can (shocking, but IMHO this case warrants it as there are a small number of datatables and it's easily readable and understandable) a for-loop and get to create our combined datatable and delete the individual ones at the same time:

mydts <- c("d1","d2","d3") #vector of datatable names

dt<- data.table() #empty datatable to bind objects to

for(d in mydts){
  dt <- rbind(dt, get(d))
  rm(list=d)
  gc() #garbage collection
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-12-21 09:22
              
            
            
                                                                       
Other approach, using a temporary file to 'bind':

nobs=10000
d1 <- d2 <- d3 <-  data.table(a=rnorm(nobs),b=rnorm(nobs))
ll<-c('d1','d2','d3')
tmp<-tempfile()

# Write all, writing header only for the first one
for(i in seq_along(ll)) {
  write.table(get(ll[i]),tmp,append=(i!=1),row.names=FALSE,col.names=(i==1))
}

# 'Cleanup' the original objects from memory (should be done by the gc if needed when loading the file
rm(list=ll)

# Read the file in the new object
dt<-fread(tmp)

# Remove the file
unlink(tmp)


Obviously slower than the rbind method, but if you have memory contention, this won't be slower than requiring the system to swap out memory pages.

Of course if your orignal objects are loaded from file at first, prefer concatenating the files before loading in R with another tool most aimed at working with files (cat, awk, etc.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复