Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'

前端未结

关注

 3  1934

悲哀的现实 2020-12-01 03:15

I have a .csv file in such format

timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...

and when read via pd.re


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   [愿得一人]
                                             
                
                
                (楼主)
            
              
              
                2020-12-01 04:00
              

            
            
                        
This question has already been sufficiently answered, but I wanted to add in the results of some tests I was running to optimize my own code.

I was getting this format from an API: "Wed Feb 08 17:58:56 +0000 2017".

Using the default pd.to_datetime(SERIES) with an implicit conversion, it was taking over an hour to process roughly 20 million rows (depending on how much free memory I had).

That said, I tested three different conversions:

# explicit conversion of essential information only -- parse dt str: concat
def format_datetime_1(dt_series):

    def get_split_date(strdt):
        split_date = strdt.split()
        str_date = split_date[1] + ' ' + split_date[2] + ' ' + split_date[5] + ' ' + split_date[3]
        return str_date

    dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%b %d %Y %H:%M:%S')

    return dt_series

# explicit conversion of what datetime considers "essential date representation" -- parse dt str: del then join
def format_datetime_2(dt_series):

    def get_split_date(strdt):
        split_date = strdt.split()
        del split_date[4]
        str_date = ' '.join(str(s) for s in split_date)
        return str_date

    dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')

    return dt_series

# explicit conversion of what datetime considers "essential date representation" -- parse dt str: concat
def format_datetime_3(dt_series):

    def get_split_date(strdt):
        split_date = strdt.split()
        str_date = split_date[0] + ' ' + split_date[1] + ' ' + split_date[2] + ' ' + split_date[3] + ' ' + split_date[5]
        return str_date

    dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%c')

    return dt_series

# implicit conversion
def format_datetime_baseline(dt_series):

    return pd.to_datetime(dt_series)


This was the results:

# sample of 250k rows
dt_series_sample = df['created_at'][:250000]

%timeit format_datetime_1(dt_series_sample)        # best of 3: 1.56 s per loop
%timeit format_datetime_2(dt_series_sample)        # best of 3: 2.09 s per loop
%timeit format_datetime_3(dt_series_sample)        # best of 3: 1.72 s per loop
%timeit format_datetime_baseline(dt_series_sample) # best of 3: 1min 9s per loop


The first test results in an impressive 97.7% runtime reduction!

Somewhat surprisingly, it looks like even the "appropriate representation" takes longer, probably because it is semi-implicit.

Conclusion: the more explicit you are, the faster it will run.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复