How to check for wrong datetime entries (python/pandas)?

杀马特。学长 韩版系。学妹 提交于 2021-02-05 06:47:06

问题


I have an excel dataset containing datetime values of worked hours entered by employees. Now that the end of the year is near they want to report on it, however it is full of wrong entries. Thus I need to clean it.

Herebelow some examples of wrong entries.

What would be your approach when facing such datasets?

I first converted date column to datetime using df['Shiftdatum'] = pd.to_datetime(df.Shiftdatum, format='%Y-%m-%d', errors='coerce')

In below's sampledata it shows a NaT

How do I filter out these NaT's including the row's index?

[Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 NaT,
 Timestamp('2019-03-12 00:00:00')

Initial sample data:

{0: '2019-03-11 00:00:00',
 1: '2019-03-11 00:00:00',
 2: '2019-03-11 00:00:00',
 3: '2019-03-11 00:00:00',
 4: '2019-03-11 00:00:00',
 5: '2019-03-11 00:00:00',
 6: '2019-03-11 00:00:00',
 7: '2019-03-11 00:00:00',
 8: '2019-03-11 00:00:00',
 9: '2019-03-11 00:00:00',
 10: '2019-03-11 00:00:00',
 11: '2019-03-11 00:00:00',
 12: '2019-03-11 00:00:00',
 13: '2019-03-11 00:00:00',
 14: '2019-03-11 00:00:00',
 15: '2019-03-11 00:00:00',
 16: '33/11/2019',
 17: '2019-03-12 00:00:00',
 18: '2019-03-12 00:00:00',
 19: '2019-03-12 00:00:00'}

回答1:


IIUC,

you could handle this in a number of ways, you could use pd.to_datetime(column,errors='coerce') and assign your data to a new column

then with the new column, you could filter by NaT and get the unique outliers,

lets say this was the result :

data = ['033-10-2019', '100-03-2019','1003-03-2019','03-10-2019']

df = pd.DataFrame({'date_time' : data})
df['correct'] = pd.to_datetime(df['date_time'],errors='coerce')
print(df)
       date_time    correct
0   033-10-2019        NaT
1   100-03-2019        NaT
2  1003-03-2019        NaT
3    03-10-2019 2019-03-10

now - we need to grab the unique NaT values in the date_time col

errors = df.loc[df['correct'].isnull()]['date_time'].unique().tolist()
out : ['033-10-2019', '100-03-2019', '1003-03-2019']

this is the boring bit, you'll need to go through and fix the errors and pass the correct value into a dictionary :

correct_dict = {'033-10-2019' : '03-10-2019', '100-03-2019' : '03-10-2019', '1003-03-2019' : '10-03-2019'}

then map the values back into your dataframe :

df['correct'] = df['correct'].fillna(pd.to_datetime(df['date_time'].map(correct_dict)))
print(df)
      date_time    correct
0   033-10-2019 2019-03-10
1   100-03-2019 2019-03-10
2  1003-03-2019 2019-10-03
3    03-10-2019 2019-03-10

If you just want to remove the NaT values you can just dropna whilst subsetting your column

df = df.dropna(subset=['correct'])



回答2:


How do I filter out these NaT's including the row's index?

if the requirement is to find out the invalid date entries, you can try series.isna() after pd.to_datetime() with series where():

df=pd.DataFrame.from_dict(d,orient='index',columns=['Shiftdatum'])
#d is the dictionary in the question

s=pd.to_datetime(df.Shiftdatum, format='%Y-%m-%d', errors='coerce').isna()
e=df.Shiftdatum.where(s).dropna()

16    33/11/2019


来源:https://stackoverflow.com/questions/59236780/how-to-check-for-wrong-datetime-entries-python-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!