My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :
d1 = pd.DataFrame( {\'StudentID\': [\"x1\", \"x10
This:
d1.groupby('ExamenYear').agg({'Participated': len,
'Passed': lambda x: sum(x == 'yes')})
doesn't look way more awkward than the R solution, IMHO.
You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,
> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed no yes
ExamenYear
2007 1 2
2008 1 3
2009 1 2
Use the margins=True
option if you also want to see the subtotal of each row and column.
> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated no yes All
ExamenYear
2007 1 2 3
2008 1 3 4
2009 0 3 3
All 2 8 10
There is another approach that I like to use for similar problems, it uses groupby
and unstack
:
d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
(this is just the raw data from above)
d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
d2.name = "Participated"
d3.name = "Passed"
pd.DataFrame(data=[d2,d3]).T
Participated Passed
ExamenYear
2007 2 2
2008 3 3
2009 3 2
This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.
I finally decided to use apply.
I am posting what I came up with hoping that it can be useful for others.
From what I understand from Wes' book "Python for Data analysis"
Here is what I came up with
def ZahlOccurence_0(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes')})
when I run it :
d1.groupby('ExamenYear').apply(ZahlOccurence_0)
I get the correct results
All Part Pass
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
This approach would also allow me to combine frequencies with other stats
import numpy as np
d1['testValue'] = np.random.randn(len(d1))
def ZahlOccurence_1(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes'),
'test' : x['testValue'].mean()})
d1.groupby('ExamenYear').apply(ZahlOccurence_1)
All Part Pass test
ExamenYear
2007 3 2 2 0.358702
2008 4 3 3 1.004504
2009 3 3 2 0.521511
I hope someone else will find this useful