问题
I have a question about pandas dataframes in Python: I have a large dataframe df
that I split into two subsets, df1
and df2
. df1
and df2
together do not make up all of df
, they are just two mutually exclusive subsets of it. I want to plot this in ggplot with rpy2 and display the variables in the plot based on whether they come from df1
or df2
. ggplot2 requires a melted dataframe so I have to create a new dataframe that has a column saying whether each entry was from df1
or df2
, so that this column can be passed to ggplot. I tried doing it like this:
# add labels to df1, df2
df1["label"] = len(df1.index) * ["df1"]
df2["label"] = len(df2.index) * ["df2"]
# combine the dfs together
melted_df = pandas.concat([df1, df2])
Now it can be plotted as in:
# plot parameters from melted_df and colour them by df1 or df2
ggplot2.ggplot(melted_df) + ggplot2.ggplot(aes_string(..., colour="label"))
My question is whether there's an easier, short hand way of doing this. ggplot requires constant melting/unmelting dfs and it seems cumbersome to always manually add the melted form to distinct subsets of df. Thanks.
回答1:
Certainly you can simplify by using:
df1['label'] = 'df1'
(rather than df1["label"] = len(df1.index) * ["df1"]
.)
If you find yourself doing this a lot, why not create your own function? (something like this):
plot_dfs(dfs):
for i, df in enumerate(dfs):
df['label'] = 'df%s' % i+1 # note: this *changes* df
melted_df = pd.concat(dfs)
# plot parameters from melted_df and colour them by df1 or df2
ggplot2.ggplot(melted_df) + ggplot2.ggplot(aes_string(..., colour="label"))
return # the melted_df or ggplot ?
来源:https://stackoverflow.com/questions/15053834/splitting-and-concatenating-dataframes-in-python-pandas-for-plotting-with-rpy2