python: convert numerical data in pandas dataframe to floats in the presence of strings

后端 未结 4 1354
难免孤独
难免孤独 2020-12-18 21:58

I\'ve got a pandas dataframe with a column \'cap\'. This column mostly consists of floats but has a few strings in it, for instance at index 2.

df =
    cap
         


        
相关标签:
4条回答
  • 2020-12-18 22:37

    First of all the way you import you CSV is redundant, instead of doing:

    df = DataFrame(pd.read_csv(myfile.file))
    

    You can do directly:

    df = pd.read_csv(myfile.file)
    

    Then to convert to float, and put whatever is not a number as NaN:

    df = pd.to_numeric(df, errors='coerce')
    
    0 讨论(0)
  • 2020-12-18 22:46

    I tried an alternative on the above:

    for num, item in enumerate(data['col']):
        try:
            float(item)
        except:
            data['col'][num] = nan
    
    0 讨论(0)
  • 2020-12-18 22:57

    Calculations with columns of float64 dtype (rather than object) are much more efficient, so this is usually preferred... it will also allow you to do other calculations. Because of this is recommended to use NaN for missing data (rather than your own placeholder, or None).

    Is this really the answer you want?

    In [11]: df.sum()  # all strings
    Out[11]: 
    cap    5.2na2.27.67.53.0
    dtype: object
    
    In [12]: df.apply(lambda f: to_number(f[0]), axis=1).sum()  # floats and 'na' strings
    TypeError: unsupported operand type(s) for +: 'float' and 'str'
    

    You should use convert_numeric to coerce to floats:

    In [21]: df.convert_objects(convert_numeric=True)
    Out[21]: 
       cap
    0  5.2
    1  NaN
    2  2.2
    3  7.6
    4  7.5
    5  3.0
    

    Or read it in directly as a csv, by appending 'na' to the list of values to be considered NaN:

    In [22]: pd.read_csv(myfile.file, na_values=['na'])
    Out[22]: 
       cap
    0  5.2
    1  NaN
    2  2.2
    3  7.6
    4  7.5
    5  3.0
    

    In either case, sum (and many other pandas functions) will now work:

    In [23]: df.sum()
    Out[23]:
    cap    25.5
    dtype: float64
    

    As Jeff advises:

    repeat 3 times fast: object==bad, float==good

    0 讨论(0)
  • 2020-12-18 22:58

    Here is a possible workaround

    first you define a function that converts numbers to float only when needed

     def to_number(s):
        try:
            s1 = float(s)
            return s1
        except ValueError:
            return s
    

    and then you apply it row by row.


    Example:

    given

     df 
         0
      0  a
      1  2
    

    where both a and 2 are strings, we do the conversion via

    converted = df.apply(lambda f : to_number(f[0]) , axis = 1)  
    
     converted
     0    a
     1    2
    

    A direct check on the types:

    type(converted.iloc[0])                                                                                                                             
    str
    
    type(converted.iloc[1])                                                                                                                             
    float
    
    0 讨论(0)
提交回复
热议问题