pandas: to_numeric for multiple columns

匿名 (未验证) 提交于 2019-12-03 02:20:02

问题:

I'm working with the following df:

c.sort_values('2005', ascending=False).head(3)     GeoName     ComponentName   IndustryId  IndustryClassification  Description     2004    2005    2006    2007    2008    2009    2010    2011    2012    2013    2014 37926   Alabama     Real GDP by state   9   213     Support activities for mining   99  98  117     117     115     87  96  95  103     102     (NA) 37951   Alabama     Real GDP by state   34  42  Wholesale trade     9898    10613   10952   11034   11075   9722    9765    9703    9600    9884    10199 37932   Alabama     Real GDP by state   15  327     Nonmetallic mineral products manufacturing  980     968     940     1084    861     724     714     701     589     641     (NA) 

I want to force numeric on all of the years:

c['2014'] = pd.to_numeric(c['2014'], errors='coerce') 

is there an easy way to do this or do I have to type them all out?

回答1:

UPDATE: you don't need to convert your values afterwards, you can do it on-the-fly when reading your CSV:

In [165]: df=pd.read_csv(url, index_col=0, na_values=['(NA)']).fillna(0)  In [166]: df.dtypes Out[166]: GeoName                    object ComponentName              object IndustryId                  int64 IndustryClassification     object Description                object 2004                        int64 2005                        int64 2006                        int64 2007                        int64 2008                        int64 2009                        int64 2010                        int64 2011                        int64 2012                        int64 2013                        int64 2014                      float64 dtype: object 

If you need to convert multiple columns to numeric dtypes - use the following technique:

Sample source DF:

In [271]: df Out[271]:      id    a  b  c  d  e    f 0  id_3  AAA  6  3  5  8    1 1  id_9    3  7  5  7  3  BBB 2  id_7    4  2  3  5  4    2 3  id_0    7  3  5  7  9    4 4  id_0    2  4  6  4  0    2  In [272]: df.dtypes Out[272]: id    object a     object b      int64 c      int64 d      int64 e      int64 f     object dtype: object 

Converting selected columns to numeric dtypes:

In [273]: cols = df.columns.drop('id')  In [274]: df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')  In [275]: df Out[275]:      id    a  b  c  d  e    f 0  id_3  NaN  6  3  5  8  1.0 1  id_9  3.0  7  5  7  3  NaN 2  id_7  4.0  2  3  5  4  2.0 3  id_0  7.0  3  5  7  9  4.0 4  id_0  2.0  4  6  4  0  2.0  In [276]: df.dtypes Out[276]: id     object a     float64 b       int64 c       int64 d       int64 e       int64 f     float64 dtype: object 

PS if you want to select all string (object) columns use the following simple trick:

cols = df.columns[df.dtypes.eq('object')] 


回答2:

another way is using apply, one liner:

cols = ['col1', 'col2', 'col3'] data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1) 


回答3:

You can use:

print df.columns[5:] Index([u'2004', u'2005', u'2006', u'2007', u'2008', u'2009', u'2010', u'2011',        u'2012', u'2013', u'2014'],       dtype='object')  for col in  df.columns[5:]:     df[col] = pd.to_numeric(df[col], errors='coerce')  print df        GeoName      ComponentName  IndustryId  IndustryClassification  \ 37926  Alabama  Real GDP by state           9                     213    37951  Alabama  Real GDP by state          34                      42    37932  Alabama  Real GDP by state          15                     327                                           Description  2004   2005   2006   2007  \ 37926               Support activities for mining    99     98    117    117    37951                            Wholesale  trade  9898  10613  10952  11034    37932  Nonmetallic mineral products manufacturing   980    968    940   1084             2008  2009  2010  2011  2012  2013     2014   37926    115    87    96    95   103   102      NaN   37951  11075  9722  9765  9703  9600  9884  10199.0   37932    861   724   714   701   589   641      NaN   

Another solution with filter:

print df.filter(like='20')        2004   2005   2006   2007   2008  2009  2010  2011  2012  2013   2014 37926    99     98    117    117    115    87    96    95   103   102   (NA) 37951  9898  10613  10952  11034  11075  9722  9765  9703  9600  9884  10199 37932   980    968    940   1084    861   724   714   701   589   641   (NA)  for col in  df.filter(like='20').columns:     df[col] = pd.to_numeric(df[col], errors='coerce') print df        GeoName      ComponentName  IndustryId  IndustryClassification  \ 37926  Alabama  Real GDP by state           9                     213    37951  Alabama  Real GDP by state          34                      42    37932  Alabama  Real GDP by state          15                     327                                           Description  2004   2005   2006   2007  \ 37926               Support activities for mining    99     98    117    117    37951                            Wholesale  trade  9898  10613  10952  11034    37932  Nonmetallic mineral products manufacturing   980    968    940   1084             2008  2009  2010  2011  2012  2013     2014   37926    115    87    96    95   103   102      NaN   37951  11075  9722  9765  9703  9600  9884  10199.0   37932    861   724   714   701   589   641      NaN   


回答4:

If you are looking for a range of columns, you can try this:

df.iloc[7:] = df.iloc[7:].astype(float) 

The examples above will convert type to be float, for all the columns begin with the 7th to the end. You of course can use different type or different range.

I think this is useful when you have a big range of columns to convert and a lot of rows. It doesn't make you go over each row by yourself - I believe numpy do it more efficiently.

This is useful only if you know that all the required columns contain numbers only - it will not change "bad values" (like string) to be NaN for you.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!