Creating data histograms/visualizations using ipython and filtering out some values

房东的猫 提交于 2019-12-11 17:33:42

问题


I posted a question earlier ( Pandas-ipython, how to create new data frames with drill down capabilities ) and it was pointed out that it is possibly too broad so I have some more specific questions that may be easier to respond to and help me get a start with graphing data.

I have decided to try creating some visualizations of my data using Pandas (or any package accessible through ipython). The first, obvious, problem I run into is how can I filter on certain conditions. For example I type the command:

df.Duration.hist(bins=10)

but get an error due to unrecognized dtypes (there are some entries that aren't in datetime format). How can I exclude these in the original command?

Also, what if I want to create the same histogram but filtering to keep only records that have id's (in an account id field) starting with the integer (or string?) '2'?

Ultimately, I want to be able to create histograms, line plots, box plots and so on but filtering for certain months, user id's, or just bad 'dtypes'.

Can anyone help me modify the above command to add filters to it. (I'm decent with python-new to data)

tnx

update: a kind user below has been trying to help me with this problem. I have a few developments to add to the question and a more specific problem.

I have columns in my data frame for Start Time and End Time and created a 'Duration' column for time lapsed.

The Start Time/End Time columns have fields that look like:

2014/03/30 15:45

and when I apply the command pd.to_datetime() to these columns I get fields resulting that look like:

2014-03-30 15:45:00

I changed the format to datetime and created a new column which is the 'Duration' or time lapsed in one command:

df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time'])

The format of the fields in the duration column is:

01:14:00

or hh:mm:ss

to indicate time lapsed or 74 mins in the above example.

the dtype of the duration column fields (hh:mm:ss) is:

dtype('<m8[ns]')  

The question is, how can I convert these fields to just integers?


回答1:


I think you need to convert duration (timedelta64) to int (assuming you have a duration). Then the .hist method will work.

from pandas import Series
from numpy.random import rand
from numpy import timedelta64

In [21]:

a = (rand(3) *10).astype(int)
a
Out[21]:
array([3, 3, 8])
In [22]:

b = [timedelta64(x, 'D') for x in a] # This is a duration
b
Out[22]:
[numpy.timedelta64(3,'D'), numpy.timedelta64(3,'D'), numpy.timedelta64(8,'D')]
In [23]:

c = Series(b) # This is a duration
c
Out[23]:
0   3 days
1   3 days
2   8 days
dtype: timedelta64[ns]
In [27]:

d = c.apply(lambda x: x / timedelta64(1,'D')) # convert duration to int
d
Out[27]:
0    3
1    3
2    8
dtype: float64
In [28]:

d.hist()

I converted the duration to days ('D'), but you can convert it to any legal unit.



来源:https://stackoverflow.com/questions/25128537/creating-data-histograms-visualizations-using-ipython-and-filtering-out-some-val

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!