Drop dates based on condition in python

↘锁芯ラ 提交于 2020-06-01 06:57:05

问题


I'm trying to implement a condition where if the count of incorrect values is greater than 2 (2019-05-17 & 2019-05-20 in the example below) then the complete date (all the time blocks) is removed

Input

                    t_value C/IC
2019-05-17 00:00:00   0     incorrect
2019-05-17 01:00:00   0     incorrect 
2019-05-17 02:00:00   0     incorrect 
2019-05-17 03:00:00   4     correct
2019-05-17 04:00:00   5     correct 
2019-05-18 01:00:00   0     incorrect   
2019-05-18 02:00:00   6     correct  
2019-05-18 03:00:00   7     correct 
2019-05-19 04:00:00   0     incorrect
2019-05-19 09:00:00   0    incorrect 
2019-05-19 11:00:00   8    correct
2019-05-20 07:00:00   2    correct
2019-05-20 08:00:00   0    incorrect
2019-05-20 09:00:00   0    incorrect
2019-05-20 07:00:00   0    incorrect 

Desired Output

                    t_value C/IC 
2019-05-18 01:00:00   0     incorrect   
2019-05-18 02:00:00   6     correct  
2019-05-18 03:00:00   7     correct 
2019-05-19 04:00:00   0     incorrect
2019-05-19 09:00:00   0    incorrect 
2019-05-19 11:00:00   8    correct

I'm not sure which time based operation to perform to get the desired result. Thanks


回答1:


#read in data
df = pd.read_csv(StringIO(data),sep='\s{2,}', engine='python')

#give index a name 
df.index.name = 'Date'
#convert to datetime 
#and sort index
#usually safer to sort datetime index in Pandas
df.index = pd.to_datetime(df.index)
df = df.sort_index()

res = (df
       #group by date and c/ic
       .groupby([pd.Grouper(freq='1D',level='Date'),"C/IC"])
       .size()
       #get rows greater than 2 and incorrect
       .loc[lambda x: x>2,"incorrect"]
       #keep only the date index
       .droplevel(-1)
       .index
       #datetime information trapped here
       #and due to grouping, it is different from initial datetime
       #as such, we convert to string 
       #and build another batch of dates
       .astype(str)
       .tolist()
      )

res
['2019-05-17', '2019-05-20']

#build a numpy array of dates
idx = np.array(res, dtype='datetime64')

#exclude dates in idx and get final value
#aim is to get dates, irrespective of time

df.loc[~np.isin(df.index.date,idx)]

                     t_value    C/IC
Date        
2019-05-18 01:00:00     0   incorrect
2019-05-18 02:00:00     6   correct
2019-05-18 03:00:00     7   correct
2019-05-19 04:00:00     0   incorrect
2019-05-19 09:00:00     0   incorrect
2019-05-19 11:00:00     8   correct



回答2:


Misunderstood the question, sorry.

Updated answer: you can find the dates to be removed by the following:

df['_date'] = df.index.dt.date
incorrect_df = df[df['C/IC'] == 'incorrect']
incorrect_count = incorrect_df['C/IC'].groupby(by='_date').count()
dates_to_remove = set(incorrect_count[incorrect_count > 2]['_date'])
    # using set to make the later step more efficient if the df is long

Then mask the dataframe accordingly:

mask = [x not in dates_to_remove for x in df['_date']
res = df[mask]


来源:https://stackoverflow.com/questions/61862447/drop-dates-based-on-condition-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!