I have a large dataframe with ID numbers:
ID.head() Out[64]: 0 4806105017087 1 4806105017087 2 4806105017087 3 4901295030089 4 4901295030089
These are all strings at the moment.
I want to convert to int without using loops - for this I use ID.astype(int).
The problem is that some of my lines contain dirty data which cannot be converted to int, for e.g.
ID[154382] Out[58]: 'CN414149'
How can I (without using loops) remove these type of occurrences so that I can use astype with peace of mind?
You need add parameter errors='coerce' to function to_numeric:
ID = pd.to_numeric(ID, errors='coerce')
If ID is column:
df.ID = pd.to_numeric(df.ID, errors='coerce')
but non numeric are converted to NaN, so all values are float.
For int need convert NaN to some value e.g. 0 and then cast to int:
df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
Sample:
df = pd.DataFrame({'ID':['4806105017087','4806105017087','CN414149']}) print (df) ID 0 4806105017087 1 4806105017087 2 CN414149 print (pd.to_numeric(df.ID, errors='coerce')) 0 4.806105e+12 1 4.806105e+12 2 NaN Name: ID, dtype: float64 df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64) print (df) ID 0 4806105017087 1 4806105017087 2 0