Plotting event density in Python with ggplot and pandas

偶尔善良 提交于 2020-01-14 15:00:11

问题


I am trying to visualize data of this form:

  timestamp               senderId
0     735217  106758968942084595234
1     735217  114647222927547413607
2     735217  106758968942084595234
3     735217  106758968942084595234
4     735217  114647222927547413607
5     etc...

geom_density works if I don't separate the senderIds:

df = pd.read_pickle('data.pkl')
df.columns = ['timestamp', 'senderId']
plot = ggplot(aes(x='timestamp'), data=df) + geom_density()
print plot

The result looks as expected:

However if I want to show the senderIds separately, as is done in the doc, it fails:

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
ValueError: `dataset` input should have multiple elements.

Trying out with a larger dataset (~40K events):

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
numpy.linalg.linalg.LinAlgError: singular matrix

Any idea? There are some answers on SO for those errors but none seems relevant.

This is the kind of graph I want (from ggplot's doc):


回答1:


With the smaller dataset:

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
ValueError: `dataset` input should have multiple elements.

This was because some senderIds had only one row.

With the bigger dataset:

> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
numpy.linalg.linalg.LinAlgError: singular matrix

This was because for some senderIds I had multiple rows at the exact same timestamp. This is not supported by ggplot. I could solve it by using finer timestamps.



来源:https://stackoverflow.com/questions/40101519/plotting-event-density-in-python-with-ggplot-and-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!