I have a DataFrame that originates from a df.groupby().size() operation, and looks like this:
Localization RNA level
cytoplasm 1 Non-expressed 7
2 Very low 13
3 Low 8
4 Medium 6
5 Moderate 8
6 High 2
7 Very high 6
cytoplasm & nucleus 1 Non-expressed 5
2 Very low 8
3 Low 2
4 Medium 10
5 Moderate 16
6 High 6
7 Very high 5
cytoplasm & nucleus & plasma membrane 1 Non-expressed 6
2 Very low 3
3 Low 3
4 Medium 7
5 Moderate 8
6 High 4
7 Very high 1
What I want to do is to calculate the separate occurrences (i.e. the last column coming from .size()) as a percentage of the total number of occurrences in the applicable Localization.
For example: there are a total of 50 occurrences in the cytoplasm localisation (7 + 13 + 8 + 6 + 8 + 2 + 6), yielding 14 and 26 % for the Non-expressed and Very low RNA-levels, respectively.
Is there a nice way of doing this? I've been going about it with what I think is a very roundabout way, i.e. making a new DataFrame for every Localization and working on from there, but there's a lot of lines and the problem of having to merge all the resulting DataFrames in the end. I'm hoping there's a smarter way of doing it, at least!
Here is the complete example based on pandas groupby, sum functions.
The basic idea is to group data based on 'Localization' and to apply a function on group.
import pandas as pd
from StringIO import StringIO
#For Python 3: from io import StringIO
data = \
"""Localization,RNA level,Size
cytoplasm ,1 Non-expressed, 7
cytoplasm ,2 Very low ,13
cytoplasm ,3 Low , 8
cytoplasm ,4 Medium , 6
cytoplasm ,5 Moderate , 8
cytoplasm ,6 High , 2
cytoplasm ,7 Very high , 6
cytoplasm & nucleus ,1 Non-expressed, 5
cytoplasm & nucleus ,2 Very low , 8
cytoplasm & nucleus ,3 Low , 2
cytoplasm & nucleus ,4 Medium ,10
cytoplasm & nucleus ,5 Moderate ,16
cytoplasm & nucleus ,6 High , 6
cytoplasm & nucleus ,7 Very high , 5
cytoplasm & nucleus & plasma membrane,1 Non-expressed, 6
cytoplasm & nucleus & plasma membrane,2 Very low , 3
cytoplasm & nucleus & plasma membrane,3 Low , 3
cytoplasm & nucleus & plasma membrane,4 Medium , 7
cytoplasm & nucleus & plasma membrane,5 Moderate , 8
cytoplasm & nucleus & plasma membrane,6 High , 4
cytoplasm & nucleus & plasma membrane,7 Very high , 1"""
# Create the dataframe
df = pd.read_csv(StringIO(data))
df['Localization'].str.strip()
df['RNA level'].str.strip()
df['Size'].astype(int)
df['Percent'] = df.groupby('Localization')['Size'].transform(lambda x: x/sum(x))
来源:https://stackoverflow.com/questions/23627782/pandas-groupby-size-and-percentages