I\'m reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.
Here is what I\
When you call pandas.read_csv
, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv"
:
In [19]: !cat data.csv
1.5, aaa, bbb , ffffd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX
, while the second line ends at the last Y
.)
The following uses pandas.read_csv()
to read the files, with the regular expression ' *, *'
as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv()
.)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ffffd 10 XXX
1 2.5 eee fff ggg 20 YYY
Here is a column-wise solution with pandas apply:
import numpy as np
def strip_obj(col):
if col.dtypes == object:
return (col.astype(str)
.str.strip()
.replace({'nan': np.nan}))
return col
df = df.apply(strip_obj, axis=0)
This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.
The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.
import pandas as pd
data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
print ('-----')
print (data)
data['values'].str.strip()
print ('-----')
print (data)
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
You could use pandas' Series.str.strip()
method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.
import pandas as pd
def remove_whitespace(x):
try:
# remove spaces inside and outside of string
x = "".join(x.split())
except:
pass
return x
# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)
# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)
We want to:
Apply our function to each element in our dataframe - use applymap
.
Use type(x)==str
(versus x.dtype == 'object'
) because Pandas will label columns as object
for columns of mixed datatypes (an object
column may contain int
and/or str
).
str
and then strip whitespace). Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)