Can memmap pandas series. What about a dataframe?

后端未结

关注

 2  805

It seems that I can memmap the underlying data for a python series by creating a mmap\'d ndarray and using it to initialize the Series.

        def assert_re


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2020-12-05 06:06
              
            
            
                                                                       
If you change your DataFrame constructor to add the parameter copy=False you will have the behavior you want.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Edit:  Also, you want to use the underlying ndarray (rather than the pandas series).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-05 06:19
              
            
            
                                                                       
OK ... after a lot of digging here's what's going on.
Pandas' DataFrame uses the BlockManager class to organize the data internally. Contrary to the docs, DataFrame is NOT a collection of series but a collection of similarly dtyped matrices. BlockManger groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.

It can do that without copying the memory ONLY if a single ndarray matrix (a single type) is provided. Note, BlockManager (in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, the DataFrame constructor doesn't make a copy ONLY if a single matrix is the data parameter.

In short, if you have mixed types or multiple arrays as input to the constructor, or a provide a dict with a single array, you are out of luck in Pandas, and DataFrame's default BlockManager will copy your data.

In any case, one way to work around this is to force BlockManager to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...

        from pandas.core.internals import BlockManager
        class BlockManagerUnconsolidated(BlockManager):
            def __init__(self, *args, **kwargs):
                BlockManager.__init__(self, *args, **kwargs)
                self._is_consolidated = False
                self._known_consolidated = False

            def _consolidate_inplace(self): pass
            def _consolidate(self): return self.blocks


        def df_from_arrays(arrays, columns, index):
            from pandas.core.internals import make_block
            def gen():
                _len = None
                p = 0
                for a in arrays:
                    if _len is None:
                        _len = len(a)
                        assert len(index) == _len
                    assert _len == len(a)
                    yield make_block(values=a.reshape((1,_len)), placement=(p,))
                    p+=1

            blocks = tuple(gen())
            mgr = BlockManagerUnconsolidated(blocks=blocks, axes=[columns, index])
            return pd.DataFrame(mgr, copy=False)


It would be better if DataFrame or BlockManger had a consolidate=False (or assumed this behavior) if copy=False was specified.

To test:

    def assert_readonly(iloc):
       try:
           iloc[0] = 999 # Should be non-editable
           raise Exception("MUST BE READ ONLY (1)")
       except ValueError as e:
           assert "read-only" in e.message

    # Original ndarray
    n = 1000
    _arr = np.arange(0,1000, dtype=float)

    # Convert it to a memmap
    mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype)
    mm[:] = _arr[:]
    del _arr
    mm.flush()
    mm.flags['WRITEABLE'] = False  # Make immutable!

        df = df_from_arrays(
            [mm, mm, mm],
            columns=['a', 'b', 'c'],
            index=range(len(mm)))
        assert_read_only(df["a"].iloc)
        assert_read_only(df["b"].iloc)
        assert_read_only(df["c"].iloc)


It seems a little questionable to me whether there's really practical benefits to BlockManager requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from a DataFrame being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups by sizeof(dtype), which I don't think is the case). 
Ho hum...

There was some discussion about a PR to provide a non-copying constructor, which was abandoned.

It looks like there's sensible plans to phase out BlockManager, so your mileage many vary.

Also see Pandas under the hood, which helped me a lot.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复