Pandas DataFrame iloc spoils the data type

问题

Having pandas 0.19.2.

Here's an example:

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})
testdf.dtypes

Output:

A      int64
B    float64
dtype: object

Everything looks fine for now, but what I don't like is that (note, that first call is a pd.Series.iloc and the second one is pd.DataFrame.iloc)

print(type(testdf.A.iloc[0]))
print(type(testdf.iloc[0].A))

Output:

<class 'numpy.int64'>
<class 'numpy.float64'>

I found it while trying to understand why pd.DataFrame.join() operation returned almost no intersections of two int64 columns while there should be many. My guess is because of type inconsistency which might be connected with this behaviour, but I'm not sure... My short investigation revealed the thing above and now I'm confused a bit.

If someone knows how to solve it - I'll be very grateful for any hints!

UPD

Thanks to @EdChum for comments. So here is the example with my generated data and join/merge behaviour

testdf.join(testdf, on='A', rsuffix='3')

    A   B   A3  B3 
0   1   1.0 2.0 2.0
1   2   2.0 3.0 3.0
2   3   3.0 4.0 4.0
3   4   4.0 NaN NaN

And what is considered to be quite the same pd.merge(left=testdf, right=testdf, on='A') returns

    A   B_x B_y
0   1   1.0 1.0
1   2   2.0 2.0
2   3   3.0 3.0
3   4   4.0 4.0

UPD2 Replicating @EdChum comment on join and merge behaviour. The problem is that A.join(B, on='C') will use index in A and join it with column B['C'], since by default join uses index. In my case I just used merge to get desireable result.

回答1:

This is as expected. pandas tracks dtypes per column. When you call testdf.iloc[0] you are asking pandas for a row. It has to convert the entire row into a series. That row contained a float. Therefore the row as a series must be float.

However, it seems that when pandas uses loc or iloc it makes this conversion when you use a single __getitem__

Here are some interesting test cases for a testdf with one int column

testdf = pd.DataFrame({'A': [1, 2, 3, 4]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.int64'>
<class 'numpy.int64'>

Change it to OP test case

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.float64'>
<class 'numpy.int64'>

print(type(testdf.loc[0, 'A']))
print(type(testdf.iloc[0, 0]))
print(type(testdf.at[0, 'A']))
print(type(testdf.iat[0, 0]))
print(type(testdf.get_value(0, 'A')))

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>

So, it appears that when pandas uses loc or iloc it makes some conversions across rows which I still don't fully understand. I'm sure it has something to do with the fact that the nature of loc and iloc are different than at, iat, get_value in that iloc and loc allow you to access the dataframe with index arrays and boolean arrays. While at, iat, and get_value only access a single cell at a time.

Despite that

testdf.loc[0, 'A'] = 10

print(type(testdf.at[0, 'A']))

When we assign to that location via loc, pandas ensures the dtype stays consistent.

来源：https://stackoverflow.com/questions/41662881/pandas-dataframe-iloc-spoils-the-data-type

标签

python

python-3.x

pandas