pandas: to_numeric for multiple columns

后端 未结 5 1792
刺人心
刺人心 2020-11-27 03:47

I\'m working with the following df:

c.sort_values(\'2005\', ascending=False).head(3)
      GeoName ComponentName     IndustryId IndustryClassification Descri         


        
相关标签:
5条回答
  • 2020-11-27 04:27

    You can use:

    print df.columns[5:]
    Index([u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011',
           u'2012', u'2013', u'2014'],
          dtype='object')
    
    for col in  df.columns[5:]:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    print df
           GeoName      ComponentName  IndustryId  IndustryClassification  \
    37926  Alabama  Real GDP by state           9                     213   
    37951  Alabama  Real GDP by state          34                      42   
    37932  Alabama  Real GDP by state          15                     327   
    
                                          Description  2004   2005   2006   2007  \
    37926               Support activities for mining    99     98    117    117   
    37951                            Wholesale  trade  9898  10613  10952  11034   
    37932  Nonmetallic mineral products manufacturing   980    968    940   1084   
    
            2008  2009  2010  2011  2012  2013     2014  
    37926    115    87    96    95   103   102      NaN  
    37951  11075  9722  9765  9703  9600  9884  10199.0  
    37932    861   724   714   701   589   641      NaN  
    

    Another solution with filter:

    print df.filter(like='20')
           2004   2005   2006   2007   2008  2009  2010  2011  2012  2013   2014
    37926    99     98    117    117    115    87    96    95   103   102   (NA)
    37951  9898  10613  10952  11034  11075  9722  9765  9703  9600  9884  10199
    37932   980    968    940   1084    861   724   714   701   589   641   (NA)
    
    for col in  df.filter(like='20').columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    print df
           GeoName      ComponentName  IndustryId  IndustryClassification  \
    37926  Alabama  Real GDP by state           9                     213   
    37951  Alabama  Real GDP by state          34                      42   
    37932  Alabama  Real GDP by state          15                     327   
    
                                          Description  2004   2005   2006   2007  \
    37926               Support activities for mining    99     98    117    117   
    37951                            Wholesale  trade  9898  10613  10952  11034   
    37932  Nonmetallic mineral products manufacturing   980    968    940   1084   
    
            2008  2009  2010  2011  2012  2013     2014  
    37926    115    87    96    95   103   102      NaN  
    37951  11075  9722  9765  9703  9600  9884  10199.0  
    37932    861   724   714   701   589   641      NaN  
    
    0 讨论(0)
  • 2020-11-27 04:30

    UPDATE: you don't need to convert your values afterwards, you can do it on-the-fly when reading your CSV:

    In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)
    
    In [166]: df.dtypes
    Out[166]:
    GeoName                    object
    ComponentName              object
    IndustryId                  int64
    IndustryClassification     object
    Description                object
    2004                        int64
    2005                        int64
    2006                        int64
    2007                        int64
    2008                        int64
    2009                        int64
    2010                        int64
    2011                        int64
    2012                        int64
    2013                        int64
    2014                      float64
    dtype: object
    

    If you need to convert multiple columns to numeric dtypes - use the following technique:

    Sample source DF:

    In [271]: df
    Out[271]:
         id    a  b  c  d  e    f
    0  id_3  AAA  6  3  5  8    1
    1  id_9    3  7  5  7  3  BBB
    2  id_7    4  2  3  5  4    2
    3  id_0    7  3  5  7  9    4
    4  id_0    2  4  6  4  0    2
    
    In [272]: df.dtypes
    Out[272]:
    id    object
    a     object
    b      int64
    c      int64
    d      int64
    e      int64
    f     object
    dtype: object
    

    Converting selected columns to numeric dtypes:

    In [273]: cols = df.columns.drop('id')
    
    In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
    
    In [275]: df
    Out[275]:
         id    a  b  c  d  e    f
    0  id_3  NaN  6  3  5  8  1.0
    1  id_9  3.0  7  5  7  3  NaN
    2  id_7  4.0  2  3  5  4  2.0
    3  id_0  7.0  3  5  7  9  4.0
    4  id_0  2.0  4  6  4  0  2.0
    
    In [276]: df.dtypes
    Out[276]:
    id     object
    a     float64
    b       int64
    c       int64
    d       int64
    e       int64
    f     float64
    dtype: object
    

    PS if you want to select all string (object) columns use the following simple trick:

    cols = df.columns[df.dtypes.eq('object')]
    
    0 讨论(0)
  • 2020-11-27 04:37

    If you are looking for a range of columns, you can try this:

    df.iloc[7:] = df.iloc[7:].astype(float)
    

    The examples above will convert type to be float, for all the columns begin with the 7th to the end. You of course can use different type or different range.

    I think this is useful when you have a big range of columns to convert and a lot of rows. It doesn't make you go over each row by yourself - I believe numpy do it more efficiently.

    This is useful only if you know that all the required columns contain numbers only - it will not change "bad values" (like string) to be NaN for you.

    0 讨论(0)
  • 2020-11-27 04:40

    another way is using apply, one liner:

    cols = ['col1', 'col2', 'col3']
    data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
    
    0 讨论(0)
  • 2020-11-27 04:45
    df[cols] = pd.to_numeric(df[cols].stack(), errors='coerce').unstack()
    
    0 讨论(0)
提交回复
热议问题