forcing non-numeric characters to NAs in numpy (when reading a csv to a pandas dataframe)

微笑、不失礼 提交于 2019-12-11 11:33:05

问题


I have records where fields (called INDATUMA and UTDATUMA) are supposed to comprise numbers in the range of 20010101 and 20141231 (for the obvious reason). To allow missing values but retain precision up to the nearest dates, I would store them as floats (np.float64). I was hoping this would force the occasionally misformatted field (think of 2oo41oo9) to NAs, but instead breaks the import both in pandas 0.18.0 or IOPro 1.7.2.

Is there an undocumented option what could use? Or else?

The key line for the pandas attempt is

import numpy as np
import pandas as pd
treatments = pd.read_table(filename,usecols=[0,3,4,6], engine='c', dtype={'LopNr':np.uint32,'INDATUMA':np.float64,'UTDATUMA':np.float64,'DIAGNOS':object})

With the eror ValueError: invalid literal for float(): 2003o730.

I tried the following in IOPro, just in case:

import iopro
adapter = iopro.text_adapter(filename, parser='csv',delimiter='\t',output='dataframe',infer_types=False)
adapter.set_field_types({0: 'u4',3:'f8', 4:'f8',6:'object'})
all_treatments.append(adapter[[0,3,4,6]][:])

But this also breaks with iopro.lib.errors.DataTypeError: Could not convert token "2003o730" at record 1 field 3 to float64.Reason: unknown

The datafile starts as

LopNr   SJUKHUS MVO INDATUMA    UTDATUMA    HDIA    DIAGNOS OP  PVARD   EKOD1   EKOD2   EKOD3   EKOD4   EKOD5   ICD
1562    21001   046 20030707    20030711    I489A   I489A I509      2                       10
1562    21001   046 2003o730    20030801    I501    I501 I489A  DG001   2                       10

回答1:


You can use parameter converters in read_table:

def converter(num):
    try:
        return np.float(num)
    except:
        return np.nan

#define each column
converters={'INDATUMA': converter, 'UTDATUMA': converter}

df = pd.read_table(filename, converters=converters)
print df
   LopNr  SJUKHUS  MVO  INDATUMA  UTDATUMA   HDIA DIAGNOS     OP  PVARD  \
0   1562    21001   46  20030707  20030711  I489A   I489A   I509      2   
1   1562    21001   46       NaN  20030801   I501    I501  I489A  DG001   

   EKOD1  EKOD2  EKOD3  EKOD4  EKOD5  ICD  
0     10    NaN    NaN    NaN    NaN  NaN  
1      2     10    NaN    NaN    NaN  NaN  

Or post-processing with parameter errors='coerce' of to_numeric:

df['INDATUMA'] = pd.to_numeric(df['INDATUMA'], errors='coerce')
0    20030707
1         NaN
Name: INDATUMA, dtype: float64


来源:https://stackoverflow.com/questions/36006248/forcing-non-numeric-characters-to-nas-in-numpy-when-reading-a-csv-to-a-pandas-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!