Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?

问题

Given a pandas.DataFrame with a column holding mixed datatypes, like e.g.

df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']})

I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor.

I could iteratively derive a mask and use it in loc, like

m = np.array([isinstance(v, int) for v in df['mixed']])

df.loc[m, 'mixed'] *= 10

# df
#                  mixed
# 0  2020-10-04 00:00:00
# 1                 9990
# 2             a string

That does the trick but I was wondering if there was a more pandastic way of doing this?

回答1:

Still need call type

m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df
                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string

回答2:

One idea is test if numeric by to_numeric with errors='coerce' and for non missing values:

m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print (df)
                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string

Unfortunately is is slow, some another ideas:

N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})


In [29]: %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.26 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]: %timeit np.array([isinstance(v, int) for v in df['mixed']])
1.12 s ± 77.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [31]: %timeit pd.to_numeric(df['mixed'], errors='coerce').notna()
3.07 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit ([isinstance(v, int) for v in df['mixed']])
909 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [35]: %timeit df.mixed.map(lambda x : type(x))=='int'
877 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [36]: %timeit df.mixed.map(lambda x : type(x) =='int')
842 ms ± 6.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [37]: %timeit df.mixed.map(lambda x : isinstance(x, int))
807 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas by default here cannot use vectorization effectivelly, because mixed values - so is necessary elementwise approaches.

回答3:

If you want to multiple all 'numbers' then you can use the following.

Let's use pd.to_numeric with parameter errors = 'coerce' and fillna:

df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])
df

Output:

                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string

Let's add a float to the column

df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string', 100.3]})

Using @BenYo:

m = df.mixed.map(lambda x : type(x).__name__)=='int'
df.loc[m, 'mixed']*=10
df

Output (note only the integer 999 is multiplied by 10):

                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string
3                100.3

Using @jezrael and similiarly this solution:

m = pd.to_numeric(df['mixed'], errors='coerce').notna()
df.loc[m, 'mixed'] *= 10
print(df)

# Or this solution
# df['mixed'] = (pd.to_numeric(df['mixed'], errors='coerce') * 10).fillna(df['mixed'])

Output (note all numbers are multiplied by 10):

                 mixed
0  2020-10-04 00:00:00
1                 9990
2             a string
3                 1003

回答4:

If you do many calculation and have a littile more memory, I suggest you to add a column to indicate the type of the mixed, for better efficiency. After you construct this column, the calculation is much faster.

here's the code:

N = 1000000
df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string'] * N})
df["mixed_type"] = df.mixed.map(lambda x: type(x).__name__).astype('category')
m = df.mixed_type == 'int'
df.loc[m, "mixed"] *= 10
del df["mixed_type"] # after you finish all your calculation

the mixed_type column repr is

0          Timestamp
1                int
2                str
3          Timestamp
4                int
             ...    
2999995          int
2999996          str
2999997    Timestamp
2999998          int
2999999          str
Name: mixed, Length: 3000000, dtype: category
Categories (3, object): [Timestamp, int, str]

and here's the timeit

>>> %timeit df.mixed_type == 'int'
472 µs ± 57.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit df.mixed.map(lambda x : type(x).__name__)=='int'
1.12 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

回答5:

For not very long data frames I can suggest this way as well:

df = df.assign(mixed = lambda x: x.apply(lambda s: s['mixed']*10 if isinstance(s['mixed'], int) else s['mixed'],axis=1))

来源：https://stackoverflow.com/questions/64195782/python-pandas-how-to-obtain-the-datatypes-of-objects-in-a-mixed-datatype-column

标签

python

pandas

types