Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

后端 未结 8 1487
悲&欢浪女
悲&欢浪女 2020-12-04 12:05

I\'m reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.

Here is what I\

相关标签:
8条回答
  • 2020-12-04 12:31

    When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.

    For example, here's "data.csv":

    In [19]: !cat data.csv
    1.5, aaa,  bbb ,  ffffd     , 10 ,  XXX   
    2.5, eee, fff  ,       ggg, 20 ,     YYY
    

    (The first line ends with three spaces after XXX, while the second line ends at the last Y.)

    The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)

    In [20]: import pandas as pd
    
    In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
    
    In [22]: df
    Out[22]: 
         0    1    2    3   4    5
    0  1.5  aaa  bbb  ffffd  10  XXX
    1  2.5  eee  fff  ggg  20  YYY
    
    0 讨论(0)
  • 2020-12-04 12:31

    Here is a column-wise solution with pandas apply:

    import numpy as np
    
    def strip_obj(col):
        if col.dtypes == object:
            return (col.astype(str)
                       .str.strip()
                       .replace({'nan': np.nan}))
        return col
    
    df = df.apply(strip_obj, axis=0)
    

    This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.

    0 讨论(0)
  • 2020-12-04 12:32

    The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.

    import pandas as pd
    data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
    print ('-----')
    print (data)
    
    data['values'].str.strip()
    print ('-----')
    print (data)
    
    new = pd.Series([])
    new = data['values'].str.strip()
    data['values'] = new
    print ('-----')
    print (new)
    
    0 讨论(0)
  • 2020-12-04 12:36

    You could use pandas' Series.str.strip() method to do this quickly for each string-like column:

    >>> data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
    >>> data
          values
    0     ABC   
    1        DEF
    2      GHI  
    
    >>> data['values'].str.strip()
    0    ABC
    1    DEF
    2    GHI
    Name: values, dtype: object
    
    0 讨论(0)
  • 2020-12-04 12:39

    I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.

    import pandas as pd
    
    def remove_whitespace(x):
        try:
            # remove spaces inside and outside of string
            x = "".join(x.split())
    
        except:
            pass
        return x
    
    # Apply remove_whitespace to column only
    df.orderId = df.orderId.apply(remove_whitespace)
    print(df)
    
    
    # Apply to remove_whitespace to entire Dataframe
    df = df.applymap(remove_whitespace)
    print(df)
    
    0 讨论(0)
  • 2020-12-04 12:50

    We want to:

    1. Apply our function to each element in our dataframe - use applymap.

    2. Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).

    3. Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).

    Therefore, I've found the following to be the easiest:

    df.applymap(lambda x: x.strip() if type(x)==str else x)

    0 讨论(0)
提交回复
热议问题