In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I\'d like to know how many male trips took place. The
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
male_trips.count()
doesnt work? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin
and value_counts
exist (and value_counts
even comes with its own entry in pandas.core.algorithm
and also isin
isn't simply np.in1d
) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id')
followed by value_counts
.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts
method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series
based on the values in stations['id']
. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id
.