问题
I'm trying to implement a condition where if the count of incorrect values is greater than 2 (2019-05-17 & 2019-05-20 in the example below) then the complete date (all the time blocks) is removed
Input
t_value C/IC
2019-05-17 00:00:00 0 incorrect
2019-05-17 01:00:00 0 incorrect
2019-05-17 02:00:00 0 incorrect
2019-05-17 03:00:00 4 correct
2019-05-17 04:00:00 5 correct
2019-05-18 01:00:00 0 incorrect
2019-05-18 02:00:00 6 correct
2019-05-18 03:00:00 7 correct
2019-05-19 04:00:00 0 incorrect
2019-05-19 09:00:00 0 incorrect
2019-05-19 11:00:00 8 correct
2019-05-20 07:00:00 2 correct
2019-05-20 08:00:00 0 incorrect
2019-05-20 09:00:00 0 incorrect
2019-05-20 07:00:00 0 incorrect
Desired Output
t_value C/IC
2019-05-18 01:00:00 0 incorrect
2019-05-18 02:00:00 6 correct
2019-05-18 03:00:00 7 correct
2019-05-19 04:00:00 0 incorrect
2019-05-19 09:00:00 0 incorrect
2019-05-19 11:00:00 8 correct
I'm not sure which time based operation to perform to get the desired result. Thanks
回答1:
#read in data
df = pd.read_csv(StringIO(data),sep='\s{2,}', engine='python')
#give index a name
df.index.name = 'Date'
#convert to datetime
#and sort index
#usually safer to sort datetime index in Pandas
df.index = pd.to_datetime(df.index)
df = df.sort_index()
res = (df
#group by date and c/ic
.groupby([pd.Grouper(freq='1D',level='Date'),"C/IC"])
.size()
#get rows greater than 2 and incorrect
.loc[lambda x: x>2,"incorrect"]
#keep only the date index
.droplevel(-1)
.index
#datetime information trapped here
#and due to grouping, it is different from initial datetime
#as such, we convert to string
#and build another batch of dates
.astype(str)
.tolist()
)
res
['2019-05-17', '2019-05-20']
#build a numpy array of dates
idx = np.array(res, dtype='datetime64')
#exclude dates in idx and get final value
#aim is to get dates, irrespective of time
df.loc[~np.isin(df.index.date,idx)]
t_value C/IC
Date
2019-05-18 01:00:00 0 incorrect
2019-05-18 02:00:00 6 correct
2019-05-18 03:00:00 7 correct
2019-05-19 04:00:00 0 incorrect
2019-05-19 09:00:00 0 incorrect
2019-05-19 11:00:00 8 correct
回答2:
Misunderstood the question, sorry.
Updated answer: you can find the dates to be removed by the following:
df['_date'] = df.index.dt.date
incorrect_df = df[df['C/IC'] == 'incorrect']
incorrect_count = incorrect_df['C/IC'].groupby(by='_date').count()
dates_to_remove = set(incorrect_count[incorrect_count > 2]['_date'])
# using set to make the later step more efficient if the df is long
Then mask the dataframe accordingly:
mask = [x not in dates_to_remove for x in df['_date']
res = df[mask]
来源:https://stackoverflow.com/questions/61862447/drop-dates-based-on-condition-in-python