问题
df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
'03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])
What I would like to do is
a) Exclude/filter out records from the data frame if a subject has Dec 31st
and Jan 1st
in its records. Please note that year
doesn't matter.
If a subject has either Dec 31st
or Jan 1st
, we leave them as is.
But if they have both Dec 31st
and Jan 1st
, we remove one (either Dec 31st or Jan 1st) of them. note they could have multiple entries with the same date as well. Like person_id = 11
I could only do the below
df2_new = df2['dates'] != '2017-12-31' #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]
My expected output is like as shown below
For person_id = 11, we drop 12-31
because it had both 12-31
and 01-01
in their records whereas for person_id = 14, we don't drop 12-31
because it has only 12-31
in its records.
We drop 12-31
only when both 12-31
and 01-01
appear in a person's records.
回答1:
Use:
s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]
Result:
# print(df3)
person_id variable dates
0 11 admit_date 2011-01-01
1 11 admit_date 2009-01-01
4 11 admit_date 2014-04-03
5 12 admit_date 2016-08-04
6 12 admit_date 2014-03-05
7 13 admit_date 2011-02-07
8 13 admit_date 2016-08-08
9 14 admit_date 2017-12-31
10 14 admit_date 2011-05-01
11 14 admit_date 2014-05-21
12 14 admit_date 2016-07-12
回答2:
Another way
Coerce the date to day month.
Create temp column where 31st Dec
is converted to 1st Jan
Drop duplicates by Person id
and the temp column
keeping first.
df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])
person_id variable dates check
0 11 admit_date 01 Jan 01 Jan
4 11 admit_date 03 Apr 03 Apr
5 12 admit_date 04 Aug 04 Aug
6 12 admit_date 05 Mar 05 Mar
7 13 admit_date 07 Feb 07 Feb
8 13 admit_date 08 Aug 08 Aug
9 14 admit_date 31 Dec 01 Jan
10 14 admit_date 01 May 01 May
11 14 admit_date 21 May 21 May
12 14 admit_date 12 Jul 12 Jul
来源:https://stackoverflow.com/questions/62635778/exclude-a-specific-date-based-on-a-condition-using-pandas