in Pandas, when using read_csv(), how to assign a NaN to a value that's not the dtype intended?

前端 未结 2 1262
抹茶落季
抹茶落季 2020-12-15 11:56

Note: Please excuse my very low skilled English, feel free to modify the question\'s title, or the following text to be more understandable

I have t

相关标签:
2条回答
  • 2020-12-15 12:10

    A great answer, wordsmith ! Just to add a couple of minor things:

    • there is a typo in the answer, data.test_column should probably be moto.test_column
    • convert_objects is now deprecated, in favor of type-specific methods on columns, one-at-a-time [why?]

    A full working example, including the dropping of the lines containing read errors (not column count errors, covered by read_csv(..., error_bad_lines=False) would be:

    moto = pd.read_csv('reporte.csv')
    moto.test_column = pd.to_numeric(moto.test_column, errors='coerce')
    moto.dropna(axis='index',how='any',inplace=True)
    

    I would also like to offer an alternative:

    from pandas import read_csv
    import numpy as np
    
    # if the data is not a valid "number", return a NaN
    # note that it must be a float, as NaN is a float:  print(type(np.nan))
    def valid_float(y):
      try:
        return float(y)
      except ValueError:
        return np.nan
    
    # assuming the first row of the file contains the column names 'A','B','C'...
    data = read_csv('test.csv',header=0,usecols=['A','B','D'],
       converters={'A': valid_float, 'B': valid_float, 'D': valid_float} )
    
    # delete all rows ('index') with an invalid numerical entry
    data.dropna(axis='index',how='any',inplace=True)
    

    This is fairly compact and readable at the same time. For a true one-liner, it would be great to (1) re-write the validation function as lambda code, and (2) do the dropping of defective rows directly in the call to read_csv, but I could not figure out how to do either of these.

    0 讨论(0)
  • 2020-12-15 12:14

    I tried creating a csv to replicate this feedback but couldn't on pandas 0.18, so I can only recommend two methods to handle this:

    First

    If you know that your missing values are all marked by a string 'none', then do this:

    moto = pd.read_csv("test.csv", na_values=['none'])
    

    You can also add, to the na_values list, other markers that should be converted to NaNs.

    Second

    Try your first line again without using the dtype option.

    moto = pd.read_csv('reporte.csv')
    

    The read is successful because you are only getting a warning. Now execute moto.dtypes to show you which columns are objects. For the ones you want to change do the following:

    moto.test_column = pd.to_numeric(moto.test_column, errors='coerce')
    

    The 'coerce' option will convert any problematic entries, like 'none', to NaNs.

    To convert the entire dataframe at once, you can use convert_objects. You could also use it on a single column, but that usage is deprecated in favor of to_numeric. The option, convert_numeric, does the coercion to NaNs:

    moto = moto.convert_objects(convert_numeric=True)
    

    After any of these methods, proceed with fillna to do what you need to.

    0 讨论(0)
提交回复
热议问题