问题
I am using Pandas 0.18.1 with python 2.7.x. I have an empty dataframe that I read first. I see that the types of these columns are object
which is OK. When I assign one row of data, the type for numeric values changes to float64
. I was expecting int
or int64
. Why does this happen?
Is there a way to set some global option to let Pandas knows that for numeric values, treat them by default as int
unless the data has a .
? For example, [0 1.0, 2.]
, first column is int
but other two are float64
?
For example:
>>> df = pd.read_csv('foo.csv', engine='python', keep_default_na=False)
>>> print df.dtypes
bbox_id_seqno object
type object
layer object
ll_x object
ll_y object
ur_x object
ur_y object
polygon_count object
dtype: object
>>> df.loc[0] = ['a', 'b', 'c', 1, 2, 3, 4, 5]
>>> print df.dtypes
bbox_id_seqno object
type object
layer object
ll_x float64
ll_y float64
ur_x float64
ur_y float64
polygon_count float64
dtype: object
回答1:
It's not possible for Pandas to store NaN
values in integer columns.
This makes float
the obvious default choice for data storage, because as soon as missing value arises Pandas would have to change the data type for the entire column. And missing values arise very often in practice.
As for why this is, it's a restriction inherited from Numpy. Basically, Pandas needs to set aside a particular bit pattern to represent NaN
. This is straightforward for floating point numbers and it's defined in the IEEE 754 standard. It's more awkward and less efficient to do this for a fixed-width integer.
Update
Exciting news in pandas 0.24. IntegerArray is an experimental feature but might render my original answer obsolete. So if you're reading this on or after 27 Feb 2019, check out the docs for that feature.
回答2:
The why is almost certainly to do with flexibility and speed. Just because Pandas has only seen an integer in that column so far doesn't mean that you're not going to try to add a float later, which would require Pandas to go back and change the type for all that column. A float is the most robust/flexible numeric type.
There's no global way to override that behaviour (that I'm aware of), but you can use the astype
method to modify an individual DataFrame.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html
回答3:
If you are reading an empty dataframe, you can explicitly cast the types for each column after reading it.
dtypes = {
'bbox_id_seqno': object,
'type': object,
'layer': object,
'll_x': int,
'll_y': int,
'ur_x': int,
'ur_y': int,
'polygon_count': int
}
df = pd.read_csv('foo.csv', engine='python', keep_default_na=False)
for col, dtype in dtypes.iteritems():
df[col] = df[col].astype(dtype)
df.loc[0] = ['a', 'b', 'c', 1, 2, 3, 4, 5]
>>> df.dtypes
bbox_id_seqno object
type object
layer object
ll_x int64
ll_y int64
ur_x int64
ur_y int64
polygon_count int64
dtype: object
If you don't know the column names in your empty dataframe, you can initially assign everything as an int
and then let Pandas sort it out.
for col in df:
df[col] = df[col].astype(int)
来源:https://stackoverflow.com/questions/38003406/pandas-why-is-default-column-type-for-numeric-float