Check whether all dates are present in a year in pandas python

无人久伴 提交于 2021-01-29 08:24:37

问题


I have a data column like below, in which some dates are missing.

obstime

2012-01-01

2012-01-02

2012-01-03

2012-01-04

....

2016-12-28

2016-12-29

2016-12-30

2016-12-31

I want to check for all dates for each month for available years. Like in the following image


回答1:


Use:

#sample data
df = pd.DataFrame({'obstime':pd.date_range('2012-01-01', '2016-12-31')})
removed = ['2013-09-01', '2013-09-02', '2013-09-03','2014-10-09','2016-12-30']
removed1 = pd.date_range('2016-12-16', '2016-12-22')
removed2 = pd.date_range('2016-10-10', '2016-12-03')

df = df[~df['obstime'].isin(pd.to_datetime(removed).append(removed1).append(removed2))]
#print (df)

#add missing values
df1 = df.set_index('obstime', drop=False).reindex(pd.date_range('2012-01-01', '2016-12-31'))

#create mask for start and end missing values and for start and end months with NaT
m = df1['obstime'].isnull()
start_NaT = m.ne(m.shift())
end_NaT = m.ne(m.shift(-1))
start_months = df1.index.day == 1
end_months = df1.index.isin(df1.index + pd.offsets.MonthEnd(0))
mask = (start_NaT | end_NaT | start_months | end_months) & m

#mask for separated missing values
s = start_NaT.cumsum()
m1 = s.map(s.value_counts()) == 1

#for start and end days join -
df2 = df1[mask & ~m1].reset_index().rename(columns={'index':'date'})
df2['day'] = df2['date'].dt.day.astype(str)
df2 = df2.groupby(np.arange(len(df2.index)) // 2).agg({'date':'first', 'day':'-'.join})

#separate days
df3 = df1[mask & m1].copy()
df3['day'] = df3.index.day.astype(str)

#join together
df3 = pd.concat([df2.set_index('date'), df3])

#join days by , add missing months and years
df4 = (df3.groupby([df3.index.month, df3.index.year])['day']
          .agg(','.join)
          .unstack(fill_value='yes')
          .reindex(index=range(1, 13), columns=range(2008, 2017),fill_value='yes'))

print (df4)
   2008 2009 2010 2011 2012 2013 2014 2015          2016
1   yes  yes  yes  yes  yes  yes  yes  yes           yes
2   yes  yes  yes  yes  yes  yes  yes  yes           yes
3   yes  yes  yes  yes  yes  yes  yes  yes           yes
4   yes  yes  yes  yes  yes  yes  yes  yes           yes
5   yes  yes  yes  yes  yes  yes  yes  yes           yes
6   yes  yes  yes  yes  yes  yes  yes  yes           yes
7   yes  yes  yes  yes  yes  yes  yes  yes           yes
8   yes  yes  yes  yes  yes  yes  yes  yes           yes
9   yes  yes  yes  yes  yes  1-3  yes  yes           yes
10  yes  yes  yes  yes  yes  yes    9  yes         10-31
11  yes  yes  yes  yes  yes  yes  yes  yes          1-30
12  yes  yes  yes  yes  yes  yes  yes  yes  1-3,16-22,30



回答2:


My solution is based on Pandas, without any use of databases.

The idea is to reindex the source Dataframe, using "full" index (with all dates from the year range). For this test purpose, I used dates from year 2016 and 2017.

Then we leave only "just added" rows, with dates for "absent" measurements.

The remaining operations are:

  • Group by months, applying a function generating day ranges.
  • Convert to a DataFrame with "extracted" year and month.
  • Pivot the DataFrame (month as index, year as columns).
  • Add month names and set them as the index.

So the whole script can be as follows:

import pandas as pd
import calendar

# Function to be applied to date groups for each month
def fun(x):
    dt = x.result
    day = pd.Timedelta('1d')
    startDates = dt[dt.diff() != day]
    if startDates.size > 0:
        endDates = dt[(dt - dt.shift(-1)).abs() != day]
        return '&'.join([(f'{s.day}-{e.day}') for s, e in zip(startDates, endDates)])
    else:
        return 'OK'

# Source dates
dates = pd.date_range('2016-01-01', '2016-01-13')\
    .append(pd.date_range('2016-01-20', '2016-01-29'))\
    .append(pd.date_range('2016-02-10', '2016-02-20'))\
    .append(pd.date_range('2016-03-11', '2017-11-20'))\
    .append(pd.date_range('2017-11-25', '2017-12-31'))
# Source DataFrame with random results for dates given
df = pd.DataFrame(data={ 'result': np.random.randint(10, 30, len(dates))},
    index=dates)
# Index for full range of dates
idxFull = pd.date_range('2016-01-01', '2017-12-31')
# "Expand" to all dates
df2 = df.reindex(idxFull)
# Leave only "empty" rows
df2.drop(df2[df2.result.notna()].index, inplace=True)
# Copy index to result
df2.result = df2.index
# Group by months
gr = df2.groupby(pd.Grouper(freq='M'))
# Result - Series
res = gr.apply(fun)
# Result - DataFrame with year/month "extracted" from date
res2 = pd.DataFrame(data={'res': res, 'year': res.index.year,
    'month': res.index.month })
# Result - pivot'ed res2
res3 = res2.pivot(index='month', columns='year').fillna('OK')
# Add month names
res3['MonthName'] = list(calendar.month_name)[1:]
# Set month names as index
res3.set_index('MonthName', inplace=True)

When you print(res3), the result is:

                   res       
year              2016   2017
MonthName                    
January    14-19&30-31     OK
February     1-9&21-29     OK
March             1-10     OK
April               OK     OK
May                 OK     OK
June                OK     OK
July                OK     OK
August              OK     OK
September           OK     OK
October             OK     OK
November            OK  21-24
December            OK     OK


来源:https://stackoverflow.com/questions/54061029/check-whether-all-dates-are-present-in-a-year-in-pandas-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!