Importing financial data into Python Pandas using read_csv

后端 未结 1 1892
长发绾君心
长发绾君心 2020-12-11 02:19

I have a .csv with the following structure:

date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1         


        
相关标签:
1条回答
  • 2020-12-11 02:53

    You could use vectorized string methods to parse those columns after the call to read_csv:

    import pandas as pd
    import decimal
    D = decimal.Decimal
    
    data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)
    
    for col in ('active_accounts', 'transaction_count'):
        data[col] = data[col].str.replace(r',', '').astype(int)
    
    data['transaction_amount'] = (data['transaction_amount']
                                  .str.replace(r'[^-+\d.]', '').astype(D))
    
    
    print(data.dtypes)
    # date_begin_date_end    object
    # name                   object
    # name_code               int64
    # active_accounts         int64
    # transaction_amount     object
    # transaction_count       int64
    # dtype: object
    
    print(data)
    

    yields

        date_begin_date_end    name  name_code  active_accounts  \
    0    1/1/2008 1/31/2008  Name_1       1001           123456   
    1    2/1/2008 2/29/2008  Name_1       1001            43210   
    2    3/1/2008 3/31/2008  Name_1       1001           485079   
    3  12/1/2008 12/31/2008  Name_1       1001            87543   
    4    1/1/2008 1/31/2008  Name_2       1002           268456   
    5    2/1/2008 2/29/2008  Name_2       1002            53210   
    
      transaction_amount  transaction_count  
    0         7890123.45              67890  
    1          987654.32             109876  
    2      1265789433.98             777888  
    3          432098987              87987  
    4          890123.45              97890  
    5          987654.32             109876  
    

    PS. read_csv does have a converters parameter with which you could supply a function to parse the problematic columns. These functions are called once for each string. If you have a lot of rows, that could be require a lot of Python function calls. Handling the columns using vectorized string methods, as shown above should be far quicker.

    import pandas as pd
    import re
    import decimal
    D = decimal.Decimal
    
    def make_parser(cls):
        def parse_commas(text):
            return cls(re.sub(r'[^-+\d.]', '', text))
        return parse_commas
    
    to_int = make_parser(int)
    to_decimal = make_parser(D)
    
    data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
                       , converters={4: to_int, 5: to_decimal, 6: to_int})
    
    print(data)
    

    yields

        date_begin_date_end    name  name_code  active_accounts  \
    0    1/1/2008 1/31/2008  Name_1       1001           123456   
    1    2/1/2008 2/29/2008  Name_1       1001            43210   
    2    3/1/2008 3/31/2008  Name_1       1001           485079   
    3  12/1/2008 12/31/2008  Name_1       1001            87543   
    4    1/1/2008 1/31/2008  Name_2       1002           268456   
    5    2/1/2008 2/29/2008  Name_2       1002            53210   
    
      transaction_amount  transaction_count  
    0         7890123.45              67890  
    1          987654.32             109876  
    2      1265789433.98             777888  
    3          432098987              87987  
    4          890123.45              97890  
    5          987654.32             109876  
    

    and the values in the transaction_amount column are decimal.Decimals:

    In [64]: data.loc[0, 'transaction_amount']
    Out[64]: Decimal('7890123.45')
    
    0 讨论(0)
提交回复
热议问题