问题
My data is absence records from a factory. Some days there are no absences so there is no data or date recorded for that day. However, and where this gets hairy with the other examples shown, is on any given day there can be several absences for various reasons. There is not always a 1 to 1 ratio of date-to-record in the data.
The result I'm hoping for is something like this:
(index) Shift Description Instances (SUM)
01-01-14 2nd Baker Discipline 0
01-01-14 2nd Baker Vacation 0
01-01-14 1st Cooks Discipline 0
01-01-14 1st Cooks Vacation 0
01-02-14 2nd Baker Discipline 4
01-02-14 2nd Baker Vacation 3
01-02-14 1st Cooks Discipline 3
01-02-14 1st Cooks Vacation 3
And so on. The idea is all shifts and descriptions will have values for all days in the time period (in this example 1/1/2014 - 12/31/2014)
I've read several examples and the closest I've come to getting this working is here.
ts = pd.read_csv('Absentee_Data_2.csv'
, encoding = 'utf-8'
,parse_dates=[3]
,index_col=3
,dayfirst=True
)
idx = pd.date_range('01.01.2009', '12.31.2017')
ts.index = pd.DatetimeIndex(ts.index)
# ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')
But, when I uncomment the ts = ts.reindex(idx, fill_value='NaN')
I get error messages. I've tried at least 10 other ways to accomplish what I'm trying to do so I'm not 100% sure this is the right path, but it seems to have gotten me closest to any kind of progress.
Here's some sample data:
Description Unexcused Instances Date Shift
Discipline FALSE 1 Jan 2 2014 2nd Baker
Vacation TRUE 2 Jan 2 2014 1st Cooks
Discipline FALSE 3 Jan 2 2014 2nd Baker
Vacation TRUE 1 Jan 2 2014 1st Cooks
Discipline FALSE 2 Apr 8 2014 2nd Baker
Vacation TRUE 3 Apr 8 2014 1st Cooks
Discipline FALSE 1 Jun 1 2014 2nd Baker
Vacation TRUE 2 Jun 1 2014 1st Cooks
Discipline FALSE 3 Jun 1 2014 2nd Baker
Vacation TRUE 1 Jun 1 2014 1st Cooks
Vacation TRUE 2 Jul 5 2014 1st Cooks
Discipline FALSE 3 Jul 5 2014 2nd Baker
Vacation TRUE 2 Dec 3 2014 1st Cooks
Thank you in advance for you help, I'm a newbie and 2 days into this without much progress. I really appreciate how people here help with answers but most importantly instruction on why the solutions work. Newbies like me are very grateful for the wisdom shared.
回答1:
I think you just have a problem with the use of datetime, this approach worked for me
ts.set_index(['Date'],inplace=True)
ts.index = pd.to_datetime(ts.index,format='%b %d %Y')
d2 = pd.DataFrame(index=pd.date_range('2014-01-01','2014-12-31'))
print ts.join(d2,how='right')
回答2:
Actually you were pretty close of what you wanted (assuming I understood correctly the output you seem to be looking for). See my additions to your code above:
import pandas as pd
ts = pd.read_csv('Absentee_Data_2.csv', encoding = 'utf-8',parse_dates=[3],index_col=3,dayfirst=True, sep=",")
idx = pd.date_range('01.01.2009', '12.31.2017')
ts.index = pd.DatetimeIndex(ts.index)
#ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')
df2 = df1.copy()
df3 = df1.copy()
df4 = df1.copy()
dict1 = {'Description': 'Discipline', 'Instances': 0, 'Shift': '1st Cooks'}
df1 = df1.fillna(dict1)
dict1["Description"] = "Vacation"
df2 = df2.fillna(dict1)
dict1["Shift"] = "2nd Baker"
df3 = df3.fillna(dict1)
dict1["Description"] = "Discipline"
df4 = df4.fillna(dict1)
df_with_duplicates = pd.concat([df1,df2,df3,df4])
final_res = df_with_duplicates.reset_index().drop_duplicates(subset=["index"] + list(dict1.keys())).set_index("index").drop("Unexcused", axis=1)
Basically what you'd add:
- Copy 4 times the almost empty df created with
ts
(df1
) fillna(dict1)
allows to fill with static values all the NaN in the columns- Concatenate the 4 dfs, we still need to remove some duplicates as the original values from the csv are duplicated 4 times
- Drop the duplicates, we need the index to keep the values added, thus the
reset_index
followed by the `set_index("index") - Finally drop the Unexcused column
Finally a few output:
In [5]: final_res["2013-01-2"]
Out[5]:
Description Instances Shift
index
2013-01-02 Discipline 0.0 1st Cooks
2013-01-02 Vacation 0.0 1st Cooks
2013-01-02 Vacation 0.0 2nd Baker
2013-01-02 Discipline 0.0 2nd Baker
In [6]: final_res["2014-01-2"]
Out[6]:
Description Instances Shift
index
2014-01-02 Discipline 1.0 2nd Baker
2014-01-02 Vacation 2.0 1st Cooks
2014-01-02 Discipline 3.0 2nd Baker
2014-01-02 Vacation 1.0 1st Cooks
1
来源:https://stackoverflow.com/questions/45506038/pandas-reindex-to-fill-missing-dates-or-better-method-to-fill