How to remove curly braces, apostrophes and square brackets from dictionaries in a Pandas dataframe (Python)

后端未结

关注

 3  1896

I have the following data in a csv file:

from StringIO import StringIO
import pandas as pd

the_data = \"\"\"
ABC,2016-6-9 0:00,95,{\'//PurpleCar\': [115L],


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-12-21 23:45
              
            
            
                                                                       
this should do the trick

s = pd.read_csv(StringIO(the_data), sep='|', header=None, squeeze=True)

left = s.str.split(',').str[:3].apply(pd.Series)
left.columns = ['Company', 'Date', 'Volume']

right = s.str.split(',').str[3:].str.join(',') \
         .str.replace(r'[\[\]\{\}\']', '') \
         .str.replace(r'(:\s+\d+)L', r'\1') \
         .str.split(',', expand=True)
right.columns = ['Car{}'.format(i) for i in range(1, 5)]

pd.concat([left, right], axis=1)



                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2020-12-21 23:50
              
            
            
                                                                       
Edit: The file seems to be actually an escaped CSV so we don't need a custom parsing for this part.

As @Blckknght points out in the comment, the file is not a valid CSV. I'll make some assumptions in my answer. They are



You don't control the data and thus can't properly escape the commas.
The first three columns won't contain any comma.
The third column follows the syntax of a python dict.
There is always one value in the list which is in the dict values.


First, some imports

import ast
import pandas as pd


We'll just split the rows by commas as we don't need to deal with any sort of CSV escaping (assumptions #1 and #2).


rows = (line.split(",", 3) for line in the_data.splitlines() if line.strip() != "")

fixed_columns = pd.DataFrame.from_records(rows, columns=["Company", "Date", "Value", "Cars_str"])




fixed_columns = pd.read_csv(..., names=["Company", "Date", "Value", "Cars_str"])


The first three columns are fixed and we leave them as they are. The last column we can parse with ast.literal_eval because it's a dict (assumption #3). This is IMO more readable and more flexible if the format changes than regex. Also you'll detect the format change earlier.

cars = fixed_columns["Cars_str"].apply(ast.literal_eval)
del fixed_columns["Cars_str"]


And this part answers rather your other question.

We prepare functions to process the keys and values of the dict so they fail if our assumptions about content of the dict fail.

def get_single_item(list_that_always_has_single_item):
    v, = list_that_always_has_single_item
    return v

def extract_car_name(car_str):
    assert car_str.startswith("//"), car_str
    return car_str[2:]


We apply the functions and construct pd.Series which allow us to...

dynamic_columns = cars.apply(
    lambda x: pd.Series({
            extract_car_name(k): get_single_item(v) 
            for k, v in x.items()
    }))    


...add the columns to the dataframe

result = pd.concat([fixed_columns, dynamic_columns], axis=1)
result


Finally, we get the table:

  Company            Date Value  BlackCar  BlueCar  NPO-GreenCar  PinkCar  \
0     ABC   2016-6-9 0:00    95       NaN     16.0           NaN      NaN   
1     ABC  2016-6-10 0:00     0       NaN     90.0           NaN      NaN   
2     ABC  2016-6-11 0:00     0       NaN     31.0           NaN      NaN   
3     ABC  2016-6-12 0:00     0       NaN   8888.0           NaN      NaN   
4     ABC  2016-6-13 0:00     0       NaN      4.0           NaN      NaN   
5     DEF  2016-6-16 0:00     0      15.0      NaN           0.0      4.0   
6     DEF  2016-6-17 0:00     0      15.0      NaN           0.0      4.0   
7     DEF  2016-6-18 0:00     0      15.0      NaN           0.0      4.0   
8     DEF  2016-6-19 0:00     0      15.0      NaN           0.0      4.0   
9     DEF  2016-6-20 0:00     0      15.0      NaN           0.0      4.0   

   PurpleCar  WhiteCar-XYZ  YellowCar  
0      115.0           0.0      403.0  
1      219.0           0.0      381.0  
2      817.0           0.0       21.0  
3       80.0           0.0     2011.0  
4       32.0           0.0       15.0  
5       32.0           NaN        NaN  
6       32.0           NaN        NaN  
7       32.0           NaN        NaN  
8       32.0           NaN        NaN  
9       32.0           NaN        NaN  

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-12-21 23:58
              
            
            
                                                                       
I think it's better to conver the strings into two columns:

from io import StringIO
import pandas as pd


df = pd.read_csv(StringIO(the_data), sep=',', header=None)
df.columns = ['Company','Date','Volume','Car1','Car2','Car3','Car4']

cars = ["Car1", "Car2", "Car3", "Car4"]
pattern = r"//(?P<color>.+?)':.*?(?P<value>\d+)"
df2 = pd.concat([df[col].str
                    .extract(pattern)
                    .assign(value=lambda self: pd.to_numeric(self["value"]))
                    for col in cars],
                axis=1, keys=cars)


the result:

        Car1             Car2           Car3                Car4      
       color value      color value    color value         color value
0  PurpleCar   115  YellowCar   403  BlueCar    16  WhiteCar-XYZ     0
1  PurpleCar   219  YellowCar   381  BlueCar    90  WhiteCar-XYZ     0
2  PurpleCar   817  YellowCar    21  BlueCar    31  WhiteCar-XYZ     0
3  PurpleCar    80  YellowCar  2011  BlueCar  8888  WhiteCar-XYZ     0
4  PurpleCar    32  YellowCar    15  BlueCar     4  WhiteCar-XYZ     0
5  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
6  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
7  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
8  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
9  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复