Is there a better way to determine whether a variable in Pandas
and/or NumPy
is numeric
or not ?
I have a self defined
Based on @jaime's answer in the comments, you need to check .dtype.kind
for the column of interest. For example;
>>> import pandas as pd
>>> df = pd.DataFrame({'numeric': [1, 2, 3], 'not_numeric': ['A', 'B', 'C']})
>>> df['numeric'].dtype.kind in 'biufc'
>>> True
>>> df['not_numeric'].dtype.kind in 'biufc'
>>> False
NB The meaning of biufc
: b
bool, i
int (signed), u
unsigned int, f
float, c
complex. See https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.kind.html#numpy.dtype.kind
Pandas has select_dtype
function. You can easily filter your columns on int64, and float64 like this:
df.select_dtypes(include=['int64','float64'])
How about just checking type for one of the values in the column? We've always had something like this:
isinstance(x, (int, long, float, complex))
When I try to check the datatypes for the columns in below dataframe, I get them as 'object' and not a numerical type I'm expecting:
df = pd.DataFrame(columns=('time', 'test1', 'test2'))
for i in range(20):
df.loc[i] = [datetime.now() - timedelta(hours=i*1000),i*10,i*100]
df.dtypes
time datetime64[ns]
test1 object
test2 object
dtype: object
When I do the following, it seems to give me accurate result:
isinstance(df['test1'][len(df['test1'])-1], (int, long, float, complex))
returns
True
You can check whether a given column contains numeric values or not using dtypes
numerical_features = [feature for feature in train_df.columns if train_df[feature].dtypes != 'O']
Note: "O" should be capital
Just to add to all other answers, one can also use df.info()
to get whats the data type of each column.
You can use np.issubdtype to check if the dtype is a sub dtype of np.number
. Examples:
np.issubdtype(arr.dtype, np.number) # where arr is a numpy array
np.issubdtype(df['X'].dtype, np.number) # where df['X'] is a pandas Series
This works for numpy's dtypes but fails for pandas specific types like pd.Categorical as Thomas noted. If you are using categoricals is_numeric_dtype function from pandas is a better alternative than np.issubdtype.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0],
'C': [1j, 2j, 3j], 'D': ['a', 'b', 'c']})
df
Out:
A B C D
0 1 1.0 1j a
1 2 2.0 2j b
2 3 3.0 3j c
df.dtypes
Out:
A int64
B float64
C complex128
D object
dtype: object
np.issubdtype(df['A'].dtype, np.number)
Out: True
np.issubdtype(df['B'].dtype, np.number)
Out: True
np.issubdtype(df['C'].dtype, np.number)
Out: True
np.issubdtype(df['D'].dtype, np.number)
Out: False
For multiple columns you can use np.vectorize:
is_number = np.vectorize(lambda x: np.issubdtype(x, np.number))
is_number(df.dtypes)
Out: array([ True, True, True, False], dtype=bool)
And for selection, pandas now has select_dtypes:
df.select_dtypes(include=[np.number])
Out:
A B C
0 1 1.0 1j
1 2 2.0 2j
2 3 3.0 3j