问题
Pandas is a great tool for a number of data tasks. Many functions have been streamlined to efficiently be applied to columns rather than individual cells/rows. One such function is the to_datetime()
function, which I use as an example later in this question. However, there are a number of commands in pandas that, as best I can tell from the documentation, do not directly relate to dataframes. The specific function I am interested in is the pandas.Timestamp.isocalendar() function, but there are a slew of functions in the pandas.Timestamp
class (and likely other pandas classes as well) that fit this description and have minimal documentation. Is there a way to efficiently broadcast these functions to a full column's worth of data? If so, how would I do that?
Note: I know that I can use the apply()
function, but this is demonstrably slower (~5x in my test) than what I have in mind. The apply()
function is also not restricted to pandas functions, so I feel there must be a way to do this (otherwise, why have the pandas.Timestamp
class at all, when datetime
does these same things for single values?). See the below code for an example, in which I compare the pandas.to_datetime()
function to applying the datetime.strptime()
function to each row.
import pandas as pd
import datetime
from faker import Faker
import time
import copy
# Setting up fake dataframe:
Faker.seed(0)
fake = Faker()
observations=1000
dates=[fake.date_between(start_date=datetime.datetime(2020,1,1),end_date=datetime.datetime(2020,1,31)) for _ in range(observations)]
index=[x for x in range(observations)]
df=pd.DataFrame({'id' : index,'dates' : dates},columns=['id','dates'])
# Converting datetime object to string:
df['dates']=df['dates'].apply(lambda x: x.strftime('%Y-%m-%d'))
# Copy dataframe to run two time tests:
df2=copy.copy(df)
# Speed of the apply() function:
tic = time.perf_counter()
df['dates']=df['dates'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
toc = time.perf_counter()
print(f'pandas apply(lambda) completed in {toc-tic:0.4f} seconds')
# speed of the to_datetime() function:
tic = time.perf_counter()
df2['dates']=pd.to_datetime(df2['dates'],format='%Y-%m-%d')
toc = time.perf_counter()
print(f'pandas to_datetime() completed in {toc-tic:0.4f} seconds')
#Script returns:
#pandas apply(lambda) completed in 0.0107 seconds
#pandas to_datetime() completed in 0.0021 seconds
回答1:
Most of the time function can be accessed once you got a datetime64[ns] dtype (which will be created once you have a datetimeindex : for example using date_range or to_datetime).
You can then use the dt accessor to cast all datetime-like functions efficiently:
df['dates'].dt.isocalendar()
来源:https://stackoverflow.com/questions/65347513/how-do-i-efficiently-apply-pandas-timestamp-functions-to-a-full-dataframe-column