python: convert numerical data in pandas dataframe to floats in the presence of strings

后端未结

关注

 4  1360

I\'ve got a pandas dataframe with a column \'cap\'. This column mostly consists of floats but has a few strings in it, for instance at index 2.

df =
    cap


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-12-18 22:37
              
            
            
                                                                       
First of all the way you import you CSV is redundant, instead of doing:

df = DataFrame(pd.read_csv(myfile.file))


You can do directly:

df = pd.read_csv(myfile.file)


Then to convert to float, and put whatever is not a number as NaN:

df = pd.to_numeric(df, errors='coerce')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2020-12-18 22:46
              
            
            
                                                                       
I tried an alternative on the above:

for num, item in enumerate(data['col']):
    try:
        float(item)
    except:
        data['col'][num] = nan

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2020-12-18 22:57
              
            
            
                                                                       
Calculations with columns of float64 dtype (rather than object) are much more efficient, so this is usually preferred... it will also allow you to do other calculations. Because of this is recommended to use NaN for missing data (rather than your own placeholder, or None).

Is this really the answer you want?

In [11]: df.sum()  # all strings
Out[11]: 
cap    5.2na2.27.67.53.0
dtype: object

In [12]: df.apply(lambda f: to_number(f[0]), axis=1).sum()  # floats and 'na' strings
TypeError: unsupported operand type(s) for +: 'float' and 'str'


You should use convert_numeric to coerce to floats:

In [21]: df.convert_objects(convert_numeric=True)
Out[21]: 
   cap
0  5.2
1  NaN
2  2.2
3  7.6
4  7.5
5  3.0


Or read it in directly as a csv, by appending 'na' to the list of values to be considered NaN:

In [22]: pd.read_csv(myfile.file, na_values=['na'])
Out[22]: 
   cap
0  5.2
1  NaN
2  2.2
3  7.6
4  7.5
5  3.0


In either case, sum (and many other pandas functions) will now work:

In [23]: df.sum()
Out[23]:
cap    25.5
dtype: float64


As Jeff advises:


  repeat 3 times fast: object==bad, float==good

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-12-18 22:58
              
            
            
                                                                       
Here is a possible workaround

first you define a function that converts numbers to float only when needed

 def to_number(s):
    try:
        s1 = float(s)
        return s1
    except ValueError:
        return s


and then you apply it row by row.



Example: 

given

 df 
     0
  0  a
  1  2


where both a and 2 are strings, we do the conversion via

converted = df.apply(lambda f : to_number(f[0]) , axis = 1)  

 converted
 0    a
 1    2


A direct check on the types:

type(converted.iloc[0])                                                                                                                             
str

type(converted.iloc[1])                                                                                                                             
float

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复