How to remove curly braces, apostrophes and square brackets from dictionaries in a Pandas dataframe (Python)

后端 未结 3 1895
渐次进展
渐次进展 2020-12-21 23:08

I have the following data in a csv file:

from StringIO import StringIO
import pandas as pd

the_data = \"\"\"
ABC,2016-6-9 0:00,95,{\'//PurpleCar\': [115L],          


        
相关标签:
3条回答
  • 2020-12-21 23:45

    this should do the trick

    s = pd.read_csv(StringIO(the_data), sep='|', header=None, squeeze=True)
    
    left = s.str.split(',').str[:3].apply(pd.Series)
    left.columns = ['Company', 'Date', 'Volume']
    
    right = s.str.split(',').str[3:].str.join(',') \
             .str.replace(r'[\[\]\{\}\']', '') \
             .str.replace(r'(:\s+\d+)L', r'\1') \
             .str.split(',', expand=True)
    right.columns = ['Car{}'.format(i) for i in range(1, 5)]
    
    pd.concat([left, right], axis=1)
    

    0 讨论(0)
  • 2020-12-21 23:50

    Edit: The file seems to be actually an escaped CSV so we don't need a custom parsing for this part.

    As @Blckknght points out in the comment, the file is not a valid CSV. I'll make some assumptions in my answer. They are

    1. You don't control the data and thus can't properly escape the commas.
    2. The first three columns won't contain any comma.
    3. The third column follows the syntax of a python dict.
    4. There is always one value in the list which is in the dict values.

    First, some imports

    import ast
    import pandas as pd
    

    We'll just split the rows by commas as we don't need to deal with any sort of CSV escaping (assumptions #1 and #2).

    rows = (line.split(",", 3) for line in the_data.splitlines() if line.strip() != "")
    
    fixed_columns = pd.DataFrame.from_records(rows, columns=["Company", "Date", "Value", "Cars_str"])
    

    fixed_columns = pd.read_csv(..., names=["Company", "Date", "Value", "Cars_str"])
    

    The first three columns are fixed and we leave them as they are. The last column we can parse with ast.literal_eval because it's a dict (assumption #3). This is IMO more readable and more flexible if the format changes than regex. Also you'll detect the format change earlier.

    cars = fixed_columns["Cars_str"].apply(ast.literal_eval)
    del fixed_columns["Cars_str"]
    

    And this part answers rather your other question.

    We prepare functions to process the keys and values of the dict so they fail if our assumptions about content of the dict fail.

    def get_single_item(list_that_always_has_single_item):
        v, = list_that_always_has_single_item
        return v
    
    def extract_car_name(car_str):
        assert car_str.startswith("//"), car_str
        return car_str[2:]
    

    We apply the functions and construct pd.Series which allow us to...

    dynamic_columns = cars.apply(
        lambda x: pd.Series({
                extract_car_name(k): get_single_item(v) 
                for k, v in x.items()
        }))    
    

    ...add the columns to the dataframe

    result = pd.concat([fixed_columns, dynamic_columns], axis=1)
    result
    

    Finally, we get the table:

      Company            Date Value  BlackCar  BlueCar  NPO-GreenCar  PinkCar  \
    0     ABC   2016-6-9 0:00    95       NaN     16.0           NaN      NaN   
    1     ABC  2016-6-10 0:00     0       NaN     90.0           NaN      NaN   
    2     ABC  2016-6-11 0:00     0       NaN     31.0           NaN      NaN   
    3     ABC  2016-6-12 0:00     0       NaN   8888.0           NaN      NaN   
    4     ABC  2016-6-13 0:00     0       NaN      4.0           NaN      NaN   
    5     DEF  2016-6-16 0:00     0      15.0      NaN           0.0      4.0   
    6     DEF  2016-6-17 0:00     0      15.0      NaN           0.0      4.0   
    7     DEF  2016-6-18 0:00     0      15.0      NaN           0.0      4.0   
    8     DEF  2016-6-19 0:00     0      15.0      NaN           0.0      4.0   
    9     DEF  2016-6-20 0:00     0      15.0      NaN           0.0      4.0   
    
       PurpleCar  WhiteCar-XYZ  YellowCar  
    0      115.0           0.0      403.0  
    1      219.0           0.0      381.0  
    2      817.0           0.0       21.0  
    3       80.0           0.0     2011.0  
    4       32.0           0.0       15.0  
    5       32.0           NaN        NaN  
    6       32.0           NaN        NaN  
    7       32.0           NaN        NaN  
    8       32.0           NaN        NaN  
    9       32.0           NaN        NaN  
    
    0 讨论(0)
  • 2020-12-21 23:58

    I think it's better to conver the strings into two columns:

    from io import StringIO
    import pandas as pd
    
    
    df = pd.read_csv(StringIO(the_data), sep=',', header=None)
    df.columns = ['Company','Date','Volume','Car1','Car2','Car3','Car4']
    
    cars = ["Car1", "Car2", "Car3", "Car4"]
    pattern = r"//(?P<color>.+?)':.*?(?P<value>\d+)"
    df2 = pd.concat([df[col].str
                        .extract(pattern)
                        .assign(value=lambda self: pd.to_numeric(self["value"]))
                        for col in cars],
                    axis=1, keys=cars)
    

    the result:

            Car1             Car2           Car3                Car4      
           color value      color value    color value         color value
    0  PurpleCar   115  YellowCar   403  BlueCar    16  WhiteCar-XYZ     0
    1  PurpleCar   219  YellowCar   381  BlueCar    90  WhiteCar-XYZ     0
    2  PurpleCar   817  YellowCar    21  BlueCar    31  WhiteCar-XYZ     0
    3  PurpleCar    80  YellowCar  2011  BlueCar  8888  WhiteCar-XYZ     0
    4  PurpleCar    32  YellowCar    15  BlueCar     4  WhiteCar-XYZ     0
    5  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
    6  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
    7  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
    8  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
    9  PurpleCar    32   BlackCar    15  PinkCar     4  NPO-GreenCar     0
    
    0 讨论(0)
提交回复
热议问题