Dataframe Slice does not remove Index Values

后端未结

关注

 3  435

I recently had this issue with a large dataframe and its associated multi index. This simplified example will demonstrate the issue.

import pandas as pd
imp


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2020-12-04 00:08
              
            
            
                                                                       
You can make the MultiIndex unique by

df_slice.index = pd.MultiIndex.from_tuples(df_slice.index.unique(), names=idx.names)


which yields the index

MultiIndex(levels=[[u'A', u'B'], [5]],
           labels=[[0, 1], [0, 0]])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2020-12-04 00:11
              
            
            
                                                                       
I understand your concern, but I believe you have to see what is happening in pandas low-level application.

First, we must declare that indexes are supposed to be immutable. You can check more of its documentation here -> http://pandas.pydata.org/pandas-docs/stable/indexing.html#setting-metadata

When you create a dataframe object, let's name it df and you want to access its rows, basically all you do is passing a boolean series that Pandas will match with its corresponding index.

Follow this example:

index = pd.MultiIndex.from_product([['A','B'],[5,6]])
df = pd.DataFrame(data=np.random.randint(1,100,(4)), index=index, columns=["P"])

      P
A 5   5
  6  51
B 5  93
  6  76


Now, let's say we want to select the rows with P > 90. How would you do that? df[df["P"] > 90], right? But look at what df["P"] > 90 actually returns.

A  5     True
   6     True
B  5     True
   6    False
Name: P, dtype: bool


As you can see, it returns a boolean series matching the original index. Why? Because Pandas needs to map which index values have an equivalent true value, so it can select the proper outcome. So basically, during your slice opperations, you will always carry this index, because it is a mapping element for the object.

However, hope is not gone. Depending on your application, if you believe it is actually taking a huge portion of your memory, you can spend a little time doing the following:

def df_sliced_index(df):
    new_index = []
    rows = []
    for ind, row in df.iterrows():
        new_index.append(ind)
        rows.append(row)
    return pd.DataFrame(data=rows, index=pd.MultiIndex.from_tuples(new_index))

df_sliced_index(df[df['P'] > 90]).index


Which yields what I believe, is the desired output:

MultiIndex(levels=[[u'B'], [5]], labels=[[0], [0]])


But if data is too large to worry you about the size of index, I wonder how much it may cost you in terms of time. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  栀梦        
                
              
                            
                2020-12-04 00:25
              
            
            
                                                                       
My preferred way to do this is

old_idx = df_slice.index
new_idx = pd.MultiIndex.from_tuples(old_idx.to_series(), names=old_idx.names)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复