Splitting dictionary/list inside a Pandas Column into Separate Columns

后端 未结 12 1479
南方客
南方客 2020-11-22 02:50

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dat

12条回答
  •  天涯浪人
    2020-11-22 03:44

    >>> df
    
       Station ID                        Pollutants
    0        8809  {"a": "46", "b": "3", "c": "12"}
    1        8810   {"a": "36", "b": "5", "c": "8"}
    2        8811              {"b": "2", "c": "7"}
    3        8812                       {"c": "11"}
    4        8813            {"a": "82", "c": "15"}
    

    speed comparison for a large dataset of 10 million rows

    >>> df = pd.concat([df]*100000).reset_index(drop=True)
    >>> df = pd.concat([df]*20).reset_index(drop=True)
    >>> print(df.shape)
    (10000000, 2)
    
    def apply_drop(df):
        return df.join(df['Pollutants'].apply(pd.Series)).drop('Pollutants', axis=1)  
    
    def json_normalise_drop(df):
        return df.join(pd.json_normalize(df.Pollutants)).drop('Pollutants', axis=1)  
    
    def tolist_drop(df):
        return df.join(pd.DataFrame(df['Pollutants'].tolist())).drop('Pollutants', axis=1)  
    
    def vlues_tolist_drop(df):
        return df.join(pd.DataFrame(df['Pollutants'].values.tolist())).drop('Pollutants', axis=1)  
    
    def pop_tolist(df):
        return df.join(pd.DataFrame(df.pop('Pollutants').tolist()))  
    
    def pop_values_tolist(df):
        return df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
    
    
    >>> %timeit apply_drop(df.copy())
    1 loop, best of 3: 53min 20s per loop
    >>> %timeit json_normalise_drop(df.copy())
    1 loop, best of 3: 54.9 s per loop
    >>> %timeit tolist_drop(df.copy())
    1 loop, best of 3: 6.62 s per loop
    >>> %timeit vlues_tolist_drop(df.copy())
    1 loop, best of 3: 6.63 s per loop
    >>> %timeit pop_tolist(df.copy())
    1 loop, best of 3: 5.99 s per loop
    >>> %timeit pop_values_tolist(df.copy())
    1 loop, best of 3: 5.94 s per loop
    
    +---------------------+-----------+
    | apply_drop          | 53min 20s |
    | json_normalise_drop |    54.9 s |
    | tolist_drop         |    6.62 s |
    | vlues_tolist_drop   |    6.63 s |
    | pop_tolist          |    5.99 s |
    | pop_values_tolist   |    5.94 s |
    +---------------------+-----------+
    

    df.join(pd.DataFrame(df.pop('Pollutants').values.tolist())) is the fastest

提交回复
热议问题