Pandas: calculating columns conditioned on the values of 2 other columns

问题

I have the following abridged dataframe:

{'end': {0: 1995, 1: 1997, 2: 1999, 3: 2001, 4: 2003, 5: 2005, 6: 2007, 7: 2013, 8: 2014, 9: 1995, 10: 2007, 11: 2013, 12: 2014, 13: 1989,
  14: 1991, 15: 1993, 16: 1995, 17: 1997, 18: 1999, 19: 2001, 20: 2003,
  21: 2005, 22: 2007, 23: 2013, 24: 2014, 25: 1985, 26: 1987, 27: 1989, 28: 1991, 29: 1993}, 'idthomas': {0: 136, 1: 136, 2: 136, 3: 136, 4:136,
  5: 136, 6: 136, 7: 136, 8: 136, 9: 172, 10: 172, 11: 172, 12: 172,  13: 174, 14: 174, 15: 174, 16: 174, 17: 174, 18: 174, 19: 174, 20: 174, 21: 174, 22: 174, 23: 174, 24: 174, 25: 179, 26: 179, 27: 179, 28: 179,
  29: 179}, 'start': {0: 1993, 1: 1995, 2: 1997, 3: 1999, 4: 2001, 5: 2003, 6: 2005, 7: 2007, 8: 2013, 9: 1993, 10: 2001, 11: 2007, 12: 2013, 13: 1987, 14: 1989, 15: 1991, 16: 1993, 17: 1995, 18: 1997, 19: 1999, 20: 2001, 21: 2003, 22: 2005, 23: 2007, 24: 2013, 25: 1983, 26: 1985, 27: 1987, 28: 1989, 29: 1991}}


df_oddyears.head()
    end     start   idthomas
0   1995    1993    136
1   1997    1995    136
2   1999    1997    136
3   2001    1999    136
4   2003    2001    136
5   2005    2003    136
6   2007    2005    136
7   2013    2007    136
8   2014    2013    136
9   1995    1993    172
10  2007    2001    172
11  2013    2007    172
12  2014    2013    172
13  1989    1987    174
14  1991    1989    174

It represents U.S. legislator congressional terms. There are some inconvenient irregularities: startand end dates indicate beginning and end of terms and will have a 2 or 6 year difference depending on if the legislator is serving in the house or senate. All legislators have a unique idthomas and can switch from house to senate if they choose. Sometimes a legislator is not re-elected, this causes a gap in their service. Looking at idthomas == 172 you can see a gap between end == 1995 and start == 2001.

I need to calculate the years of active-accumulated public service for every year from the beginning of the legislators service until the end of the legislators service, even years included. In a next step I will merge this df with another df along years, hence I need both even and odd years of active service.

This is what I had developed before seeing deeper into the problem:

df_oddyears['end']=df_oddyears['end'].map(lambda x: str(x)[:-6])
df_oddyears['start']=df_oddyears['start'].map(lambda x: str(x[:-6]))
df_oddyears['end'] = df_oddyears['end'].astype('int')
df_oddyears['start'] = df_oddyears['start'].astype('int')
df_oddyears['end'] = df_oddyears['end'].clip_upper(2014)
df_oddyears['term'] = df_oddyears.end - df_oddyears.start
df_oddyears['years_exp']=df_oddyears.groupby(['id.thomas']).term.cumsum()
df_oddyears.rename(columns={'id.thomas':'idthomas'},inplace=True)

df_oddyears.head()

    end     start   idthomas    term    years_exp
0   1995    1993    136           2     2
1   1997    1995    136           2     4
2   1999    1997    136           2     6
3   2001    1999    136           2     8
4   2003    2001    136           2     10
5   2005    2003    136           2     12
6   2007    2005    136           2     14
7   2013    2007    136           6     20
8   2014    2013    136           1     21
9   1995    1993    172           2     2
10  2007    2001    172           6     8
11  2013    2007    172           6     14
12  2014    2013    172           1     15

{'end': {0: 1995, 1: 1997, 2: 1999, 3: 2001, 4: 2003, 5: 2005, 6: 2007,
  7: 2013, 8: 2014, 9: 1995, 10: 2007, 11: 2013, 12: 2014, 13: 1989,
  14: 1991, 15: 1993, 16: 1995, 17: 1997, 18: 1999, 19: 2001, 20: 2003,
  21: 2005, 22: 2007, 23: 2013, 24: 2014, 25: 1985, 26: 1987, 27: 1989,
  28: 1991, 29: 1993}, 'idthomas': {0: 136, 1: 136, 2: 136, 3: 136,
  4: 136, 5: 136, 6: 136, 7: 136, 8: 136, 9: 172, 10: 172, 11: 172,  12: 172, 13: 174, 14: 174, 15: 174, 16: 174, 17: 174, 18: 174, 19: 174,
  20: 174, 21: 174, 22: 174, 23: 174, 24: 174, 25: 179, 26: 179, 27: 179, 28: 179, 29: 179},'start': {0: 1993, 1: 1995, 2: 1997, 3: 1999,  4: 2001, 5: 2003, 6: 2005, 7: 2007, 8: 2013, 9: 1993, 10: 2001, 11: 2007, 12: 2013, 13: 1987, 14: 1989, 15: 1991, 16: 1993, 17: 1995, 18: 1997, 19: 1999, 20: 2001, 21: 2003, 22: 2005, 23: 2007, 24: 2013, 25: 1983, 26: 1985, 27: 1987, 28: 1989, 29: 1991},'term': {0: 2, 1: 2, 2: 2,  3: 2, 4: 2, 5: 2, 6: 2, 7: 6, 8: 1, 9: 2, 10: 6, 11: 6, 12: 1, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 6,
  24: 1, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2},'years_exp': {0: 2, 1: 4,
  2: 6, 3: 8, 4: 10, 5: 12, 6: 14, 7: 20, 8: 21, 9: 2, 10: 8, 11: 14,
  12: 15, 13: 2, 14: 4, 15: 6, 16: 8, 17: 10, 18: 12, 19: 14, 20: 16,
  21: 18, 22: 20, 23: 26, 24: 27, 25: 2, 26: 4, 27: 6, 28: 8, 29: 10}}

Then I df=df_oddyears.drop(['start', 'term'], axis=1, inplace=False) and implement the following code from here

    final_year = 2014
df= pd.DataFrame([(year, id_, n) 
                  for id_, end, years_exp in df.groupby('idthomas').first().itertuples() 
                  for n, year in enumerate(range(end, final_year + 1), years_exp)], 
                 columns=['end', 'idthomas', 'years_exp'])

df.head()

        end     idthomas    years_exp
673     1995    172     2
674     1996    172     3
675     1997    172     4
676     1998    172     5
677     1999    172     6
678     2000    172     7
679     2001    172     8
680     2002    172     9
681     2003    172     10

This is very close to what I want in that it enables me to concatenate to another df on end while maintaining the total years_exp. Unfortunately, I failed to recognize the problem of intermittent service when posting my original question; such that, years_expdoes not take into account gaps in public service. This is (the first in a list of) project(s) today. If anyone has questions or suggestions or critiques, they are all welcome.

My desired end result would be the following:

    end idthomas    years_exp
0   1994    136      1
1   1995    136      2
2   1996    136      3
3   1997    136      4
4   1998    136      5
5   1999    136      6
6   2000    136      7
7   2001    136      8
8   2002    136      9
9   2003    136      10
10  2004    136      11
11  2005    136      12
12  2006    136      13
13  2007    136      14
14  2008    136      15
15  2009    136      16
16  2010    136      17
17  2011    136      18
18  2012    136      19
19  2013    136      20
20  2014    136      21
21  1994    172      1
22  1995    172      2
23  2001    172      2
24  2002    172      3
25  2003    172      4
26  2004    172      5
27  2005    172      6
28  2006    172      7 
29  2007    172      8
30  2008    172      9
31  2009    172      10
32  2010    172      11
33  2011    172      12
34  2012    172      13
35  2013    172      14
36  2014    172      15

来源：https://stackoverflow.com/questions/36661846/pandas-calculating-columns-conditioned-on-the-values-of-2-other-columns

标签

python

pandas

ipython