问题
I have an excel dataset containing datetime values of worked hours entered by employees. Now that the end of the year is near they want to report on it, however it is full of wrong entries. Thus I need to clean it.
Herebelow some examples of wrong entries.
What would be your approach when facing such datasets?
I first converted date column to datetime using df['Shiftdatum'] = pd.to_datetime(df.Shiftdatum, format='%Y-%m-%d', errors='coerce')
In below's sampledata it shows a NaT
How do I filter out these NaT's including the row's index?
[Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
Timestamp('2019-03-11 00:00:00'),
NaT,
Timestamp('2019-03-12 00:00:00')
Initial sample data:
{0: '2019-03-11 00:00:00',
1: '2019-03-11 00:00:00',
2: '2019-03-11 00:00:00',
3: '2019-03-11 00:00:00',
4: '2019-03-11 00:00:00',
5: '2019-03-11 00:00:00',
6: '2019-03-11 00:00:00',
7: '2019-03-11 00:00:00',
8: '2019-03-11 00:00:00',
9: '2019-03-11 00:00:00',
10: '2019-03-11 00:00:00',
11: '2019-03-11 00:00:00',
12: '2019-03-11 00:00:00',
13: '2019-03-11 00:00:00',
14: '2019-03-11 00:00:00',
15: '2019-03-11 00:00:00',
16: '33/11/2019',
17: '2019-03-12 00:00:00',
18: '2019-03-12 00:00:00',
19: '2019-03-12 00:00:00'}
回答1:
IIUC,
you could handle this in a number of ways, you could use pd.to_datetime(column,errors='coerce') and assign your data to a new column
then with the new column, you could filter by NaT and get the unique outliers,
lets say this was the result :
data = ['033-10-2019', '100-03-2019','1003-03-2019','03-10-2019']
df = pd.DataFrame({'date_time' : data})
df['correct'] = pd.to_datetime(df['date_time'],errors='coerce')
print(df)
date_time correct
0 033-10-2019 NaT
1 100-03-2019 NaT
2 1003-03-2019 NaT
3 03-10-2019 2019-03-10
now - we need to grab the unique NaT values in the date_time col
errors = df.loc[df['correct'].isnull()]['date_time'].unique().tolist()
out : ['033-10-2019', '100-03-2019', '1003-03-2019']
this is the boring bit, you'll need to go through and fix the errors and pass the correct value into a dictionary :
correct_dict = {'033-10-2019' : '03-10-2019', '100-03-2019' : '03-10-2019', '1003-03-2019' : '10-03-2019'}
then map the values back into your dataframe :
df['correct'] = df['correct'].fillna(pd.to_datetime(df['date_time'].map(correct_dict)))
print(df)
date_time correct
0 033-10-2019 2019-03-10
1 100-03-2019 2019-03-10
2 1003-03-2019 2019-10-03
3 03-10-2019 2019-03-10
If you just want to remove the NaT values you can just dropna whilst subsetting your column
df = df.dropna(subset=['correct'])
回答2:
How do I filter out these NaT's including the row's index?
if the requirement is to find out the invalid date entries, you can try series.isna() after pd.to_datetime() with series where():
df=pd.DataFrame.from_dict(d,orient='index',columns=['Shiftdatum'])
#d is the dictionary in the question
s=pd.to_datetime(df.Shiftdatum, format='%Y-%m-%d', errors='coerce').isna()
e=df.Shiftdatum.where(s).dropna()
16 33/11/2019
来源:https://stackoverflow.com/questions/59236780/how-to-check-for-wrong-datetime-entries-python-pandas