问题
I have records where fields (called INDATUMA
and UTDATUMA
) are supposed to comprise numbers in the range of 20010101 and 20141231 (for the obvious reason). To allow missing values but retain precision up to the nearest dates, I would store them as floats (np.float64). I was hoping this would force the occasionally misformatted field (think of 2oo41oo9) to NA
s, but instead breaks the import both in pandas 0.18.0 or IOPro 1.7.2.
Is there an undocumented option what could use? Or else?
The key line for the pandas attempt is
import numpy as np
import pandas as pd
treatments = pd.read_table(filename,usecols=[0,3,4,6], engine='c', dtype={'LopNr':np.uint32,'INDATUMA':np.float64,'UTDATUMA':np.float64,'DIAGNOS':object})
With the eror ValueError: invalid literal for float(): 2003o730
.
I tried the following in IOPro, just in case:
import iopro
adapter = iopro.text_adapter(filename, parser='csv',delimiter='\t',output='dataframe',infer_types=False)
adapter.set_field_types({0: 'u4',3:'f8', 4:'f8',6:'object'})
all_treatments.append(adapter[[0,3,4,6]][:])
But this also breaks with iopro.lib.errors.DataTypeError: Could not convert token "2003o730" at record 1 field 3 to float64.Reason: unknown
The datafile starts as
LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD
1562 21001 046 20030707 20030711 I489A I489A I509 2 10
1562 21001 046 2003o730 20030801 I501 I501 I489A DG001 2 10
回答1:
You can use parameter converters
in read_table:
def converter(num):
try:
return np.float(num)
except:
return np.nan
#define each column
converters={'INDATUMA': converter, 'UTDATUMA': converter}
df = pd.read_table(filename, converters=converters)
print df
LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD \
0 1562 21001 46 20030707 20030711 I489A I489A I509 2
1 1562 21001 46 NaN 20030801 I501 I501 I489A DG001
EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD
0 10 NaN NaN NaN NaN NaN
1 2 10 NaN NaN NaN NaN
Or post-processing with parameter errors='coerce'
of to_numeric:
df['INDATUMA'] = pd.to_numeric(df['INDATUMA'], errors='coerce')
0 20030707
1 NaN
Name: INDATUMA, dtype: float64
来源:https://stackoverflow.com/questions/36006248/forcing-non-numeric-characters-to-nas-in-numpy-when-reading-a-csv-to-a-pandas-d