Split datetime64 column into a date and time column in pandas dataframe

后端 未结 1 1619
陌清茗
陌清茗 2020-12-16 04:47

If I have a dataframe with the first column being a datetime64 column. How do I split this column into 2 new columns, a date column and a time column. Here is my data and co

相关标签:
1条回答
  • 2020-12-16 05:27

    How to parse the CSV directly into the desired DataFrame:

    Pass a dict of functions to pandas.read_csv's converters keyword argument:

    import pandas as pd
    import datetime as DT
    nfp = pd.read_csv("NFP.csv", 
                      sep=r'[\s,]',              # 1
                      header=None, skiprows=1,
                      converters={               # 2
                          0: lambda x: DT.datetime.strptime(x, '%Y%m%d'),  
                          1: lambda x: DT.time(*map(int, x.split(':')))},
                      names=['Date', 'Time', 'Actual', 'Consensus', 'Previous'])
    
    print(nfp)
    

    yields

            Date      Time  Actual  Consensus  Previous
    0 2014-01-10  13:30:00   74000     196000    241000
    1 2013-12-06  13:30:00  241000     180000    200000
    2 2013-11-08  13:30:00  200000     125000    163000
    3 2013-10-22  12:30:00  163000     180000    193000
    4 2013-09-06  12:30:00  193000     180000    104000
    5 2013-08-02  12:30:00  104000     184000    188000
    6 2013-07-05  12:30:00  188000     165000    176000
    7 2013-06-07  12:30:00  176000     170000    165000
    8 2013-05-03  12:30:00  165000     145000    138000
    9 2013-04-05  12:30:00  138000     200000    268000
    
    1. sep=r'[\s,]' tells read_csv to split lines of the csv on the regex pattern r'[\s,]' -- a whitespace or a comma.
    2. The converters parameter tells read_csv to apply the given functions to certain columns. The keys (e.g. 0 and 1) refer to the column index, and the values are the functions to be applied.

    How to split the DataFrame after performing csv_read

    import pandas as pd
    nfp = pd.read_csv("NFP.csv", parse_dates=[0], infer_datetime_format=True)
    temp = pd.DatetimeIndex(nfp['DateTime'])
    nfp['Date'] = temp.date
    nfp['Time'] = temp.time
    del nfp['DateTime']
    
    print(nfp)
    

    Which is faster?

    It depends on the size of the CSV. (Thanks to Jeff for pointing this out.)

    For tiny CSVs, parsing the CSV into the desired form directly is faster than using a DatetimeIndex after parsing with parse_dates=[0]:

    def using_converter():
        nfp = pd.read_csv("NFP.csv", sep=r'[\s,]', header=None, skiprows=1,
                          converters={
                              0: lambda x: DT.datetime.strptime(x, '%Y%m%d'),
                              1: lambda x: DT.time(*map(int, x.split(':')))},
                          names=['Date', 'Time', 'Actual', 'Consensus', 'Previous'])
        return nfp
    
    def using_index():
        nfp = pd.read_csv("NFP.csv", parse_dates=[0], infer_datetime_format=True)
        temp = pd.DatetimeIndex(nfp['DateTime'])
        nfp['Date'] = temp.date
        nfp['Time'] = temp.time
        del nfp['DateTime']
        return nfp
    
    In [114]: %timeit using_index()
    100 loops, best of 3: 1.71 ms per loop
    
    In [115]: %timeit using_converter()
    1000 loops, best of 3: 914 µs per loop
    

    However, for CSVs of just a few hundred lines or more, using a DatetimeIndex is faster.

    N = 20
    filename = '/tmp/data'
    content = '''\
    DateTime,Actual,Consensus,Previous
    20140110 13:30:00,74000,196000,241000
    20131206 13:30:00,241000,180000,200000
    20131108 13:30:00,200000,125000,163000
    20131022 12:30:00,163000,180000,193000
    20130906 12:30:00,193000,180000,104000
    20130802 12:30:00,104000,184000,188000
    20130705 12:30:00,188000,165000,176000
    20130607 12:30:00,176000,170000,165000
    20130503 12:30:00,165000,145000,138000
    20130405 12:30:00,138000,200000,268000'''
    
    def setup(n):
        header, remainder = content.split('\n', 1)
        with open(filename, 'w') as f:
            f.write('\n'.join([header]+[remainder]*n))
    
    In [304]: setup(50)
    
    In [305]: %timeit using_converter()
    100 loops, best of 3: 9.78 ms per loop
    
    In [306]: %timeit using_index()
    100 loops, best of 3: 9.3 ms per loop
    

    Where can I look up this kind of information?

    1. Sometimes you can find examples in the Pandas Cookbook.
    2. Sometimes web searching or searching Stackoverflow suffices.
    3. Spending a weekend snowed in with nothing to do but reading the pandas documentation will surely help too.
    4. Install IPython. It has tab completion and if you type a ? after a function, it gives you the function's docstring. Those two features really help you introspect Python objects quickly. It also tells you in what file the function is defined (if defined in pure Python) -- which leads me to...
    5. Reading the source code

    Just keep at it. The more you know the easier it gets.

    If you give it your best shot and still can't find the answer, post a question on Stackoverflow. You'll hopefully get an answer quickly, and help others searching for the same thing.

    0 讨论(0)
提交回复
热议问题