问题
city temperature windspeed event
day
2017-01-01 new york 32 6 Rain
2017-01-02 new york 36 7 Sunny
2017-01-03 new york 28 12 Snow
2017-01-04 new york 33 7 Sunny
2017-01-05 new york 31 7 Rain
2017-01-06 new york 33 5 Sunny
2017-01-07 new york 27 12 Rain
2017-01-08 new york 23 7 Rain
2017-01-01 mumbai 90 5 Sunny
2017-01-02 mumbai 85 12 Fog
2017-01-03 mumbai 87 15 Fog
2017-01-04 mumbai 92 5 Rain
2017-01-05 mumbai 89 7 Sunny
2017-01-06 mumbai 80 10 Fog
2017-01-07 mumbai 85 9 Sunny
2017-01-08 mumbai 89 8 Rain
2017-01-01 paris 45 20 Sunny
2017-01-02 paris 50 13 Cloudy
2017-01-03 paris 54 8 Cloudy
2017-01-04 paris 42 10 Cloudy
2017-01-05 paris 43 20 Sunny
2017-01-06 paris 48 4 Cloudy
2017-01-07 paris 40 14 Rain
2017-01-08 paris 42 15 Cloudy
2017-01-09 paris 53 8 Sunny
The above shows the original data.
Below shows the result using np.array_split(data, 4).
day city temperature windspeed event
2017-01-01 new york 32 6 Rain
2017-01-02 new york 36 7 Sunny
2017-01-03 new york 28 12 Snow
2017-01-04 new york 33 7 Sunny
2017-01-05 new york 31 7 Rain
2017-01-06 new york 33 5 Sunny
2017-01-07 new york 27 12 Rain
day city temperature windspeed event
2017-01-08 new york 23 7 Rain
2017-01-01 mumbai 90 5 Sunny
2017-01-02 mumbai 85 12 Fog
2017-01-03 mumbai 87 15 Fog
2017-01-04 mumbai 92 5 Rain
2017-01-05 mumbai 89 7 Sunny
day city temperature windspeed event
2017-01-06 mumbai 80 10 Fog
2017-01-07 mumbai 85 9 Sunny
2017-01-08 mumbai 89 8 Rain
2017-01-01 paris 45 20 Sunny
2017-01-02 paris 50 13 Cloudy
2017-01-03 paris 54 8 Cloudy
day city temperature windspeed event
2017-01-04 paris 42 10 Cloudy
2017-01-05 paris 43 20 Sunny
2017-01-06 paris 48 4 Cloudy
2017-01-07 paris 40 14 Rain
2017-01-08 paris 42 15 Cloudy
2017-01-09 paris 53 8 Sunny
As you can see here, I'm trying to create 4 groups from the original data, making sure that each group has all the cities. however, by using array.split(), it split the data into 4 groups but it does not contain all the cities. I want each group to have Mumbai, Paris and New York. How can I do that?
Meaning to say, what I'm trying to achieve is something like below:
Group 1:
day city temperature windspeed event
2017-01-01 new york 32 6 Rain
2017-01-02 paris 50 13 Cloudy
2017-01-02 mumbai 85 12 Fog,
2017-01-05 new york 31 7 Rain
2017-01-06 new york 33 5 Sunny
2017-01-05 mumbai 89 7 Sunny
2017-01-05 paris 43 20 Sunny
Group 2:
day city temperature windspeed event
2017-01-04 new york 33 7 Sunny
2017-01-01 mumbai 90 5 Sunny
2017-01-03 paris 54 8 Cloudy
2017-01-07 new york 27 12 Rain
2017-01-06 mumbai 80 10 Fog
2017-01-09 paris 53 8 Sunny
Group 3:
day city temperature windspeed event
2017-01-02 new york 36 7 Sunny
2017-01-03 mumbai 87 15 Fog
2017-01-01 paris 45 20 Sunny,
2017-01-08 mumbai 89 8 Rain
2017-01-06 paris 48 4 Cloudy
2017-01-07 paris 40 14 Rain
Group 4:
day city temperature windspeed event
2017-01-03 new york 28 12 Snow,
2017-01-04 mumbai 92 5 Rain
2017-01-07 mumbai 85 9 Sunny
2017-01-04 paris 42 10 Cloudy
2017-01-08 paris 42 15 Cloudy
2017-01-08 new york 23 7 Rain
As you can see from the expected result, the main thing is that all the groups contain each topic.
What I have in mind is to group the data by city, then from each city's dataframe, divide the data into 4 groups, then for each group in the city, combine the data to get 4 final group.
回答1:
You can create a helper column via GroupBy
+ cumcount
to count the occurrence of each city.
Then use dict
+ tuple
with another GroupBy
to create a dictionary of dataframes, each one containing exactly one occurence of each city.
# add index column giving count of city occurrence
df['index'] = df.groupby('city').cumcount()
# create dictionary of dataframes
d = dict(tuple(df.groupby('index')))
Result:
print(d)
{0: city temperature windspeed event index
day
2017-01-01 newyork 32 6 Rain 0
2017-01-01 mumbai 90 5 Sunny 0
2017-01-01 paris 45 20 Sunny 0,
1: city temperature windspeed event index
day
2017-01-02 newyork 36 7 Sunny 1
2017-01-02 mumbai 85 12 Fog 1
2017-01-02 paris 50 13 Cloudy 1,
2: city temperature windspeed event index
day
2017-01-03 newyork 28 12 Snow 2
2017-01-03 mumbai 87 15 Fog 2
2017-01-03 paris 54 8 Cloudy 2,
3: city temperature windspeed event index
day
2017-01-04 newyork 33 7 Sunny 3
2017-01-04 mumbai 92 5 Rain 3
2017-01-04 paris 42 10 Cloudy 3}
You can then extract individual "groups" via d[0]
, d[1]
, d[2]
, d[3]
. In this particular case, you may wish to group by dates instead, i.e.
d = {df_.index[0]: df_ for _, df_ in df.groupby('index')}
回答2:
This is my approach to this. First sort your dataframe by day
and city
:
df = df.sort_values(by=['day', 'city'])
Next find an even split of 4 groups for your dataframe - if the split is not even then the last group will get the remaining:
n = int(len(df)/4)
groups_n = np.cumsum([0, n, n, n, len(df)-(3*n)])
print(groups_n)
OUT >> array([ 0, 6, 12, 18, 25], dtype=int32)
groups_n
is the start
and end
for each group. So Group 1
I will take df.iloc[0:6]
and Group 4
I will take df.iloc[18:25]
.
So your final dictionary, d
, of the 4 group split of your dataframe will be:
d = {}
for i in range(4):
d[i+1] = df.iloc[groups_n[i]:groups_n[i+1]]
Example Outputs:Group 1 (
d[1]
)
city temperature windspeed event
day
2017-01-01 mumbai 90 5 Sunny
2017-01-01 new york 32 6 Rain
2017-01-01 paris 45 20 Sunny
2017-01-02 mumbai 85 12 Fog
2017-01-02 new york 36 7 Sunny
2017-01-02 paris 50 13 Cloudy
Group 4: (
d[4]
)
city temperature windspeed event
day
2017-01-07 mumbai 85 9 Sunny
2017-01-07 new york 27 12 Rain
2017-01-07 paris 40 14 Rain
2017-01-08 mumbai 89 8 Rain
2017-01-08 new york 23 7 Rain
2017-01-08 paris 42 15 Cloudy
2017-01-09 paris 53 8 Sunny
来源:https://stackoverflow.com/questions/51723928/splitting-groupby-in-pandas-into-smaller-groups-and-combining-them