Replacing newlines with spaces for str columns through pandas dataframe

痴心易碎 提交于 2019-12-10 11:17:38

问题


Given an example dataframe with the 2nd and 3rd columns of free text, e.g.

>>> import pandas as pd
>>> lol = [[1,2,'abc','foo\nbar'], [3,1, 'def\nhaha', 'love it\n']]
>>> pd.DataFrame(lol)
   0  1          2          3
0  1  2        abc   foo\nbar
1  3  1  def\nhaha  love it\n

The goal is to replace the \n to (whitespace) and strip the string in column 2 and 3 to achieve:

>>> pd.DataFrame(lol)
   0  1         2        3
0  1  2       abc  foo bar
1  3  1  def haha  love it

How to replace newlines with spaces for specific columns through pandas dataframe?

I have tried this:

>>> import pandas as pd
>>> lol = [[1,2,'abc','foo\nbar'], [3,1, 'def\nhaha', 'love it\n']]

>>> replace_and_strip = lambda x: x.replace('\n', ' ').strip()

>>> lol2 = [[replace_and_strip(col) if type(col) == str else col for col in list(row)] for idx, row in pd.DataFrame(lol).iterrows()]

>>> pd.DataFrame(lol2)
   0  1         2        3
0  1  2       abc  foo bar
1  3  1  def haha  love it

But there must be a better/simpler way.


回答1:


Use replace - first first and last strip and then replace \n:

df = df.replace({r'\s+$': '', r'^\s+': ''}, regex=True).replace(r'\n',  ' ', regex=True)
print (df)
   0  1         2        3
0  1  2       abc  foo bar
1  3  1  def haha  love it



回答2:


You may use the following two regex replace approach:

>>> df.replace({ r'\A\s+|\s+\Z': '', '\n' : ' '}, regex=True, inplace=True)
>>> df
   0  1         2        3
0  1  2       abc  foo bar
1  3  1  def haha  love it
>>> 

Details

  • '\A\s+|\s+\Z' -> '' will act like strip() removing all leading and trailing whitespace:
    • \A\s+ - matches 1 or more whitespace symbols at the start of the string
    • | - or
    • \s+\Z - matches 1 or more whitespace symbols at the end of the string
  • '\n' -> ' ' will replace any newline with a space.



回答3:


You can select_dtypes to select columns of type object and use applymap on those columns.

Because there is no inplace argument for these functions, this would be a workaround to make change to the dataframe:

strs = lol.select_dtypes(include=['object']).applymap(lambda x: x.replace('\n', ' ').strip())
lol[strs.columns] = strs
lol
#   0  1         2        3
#0  1  2       abc  foo bar
#1  3  1  def haha  love it



回答4:


Adding to the other nice answers, this is a vectorized version of your initial idea:

columns = [2,3] 
df.iloc[:, columns] = [df.iloc[:,col].str.strip().str.replace('\n',' ') 
                       for col in columns] 

Details:

In [49]: df.iloc[:, columns] = [df.iloc[:,col].str.strip().str.replace('\n',' ') 
                                 for col in columns]  

In [50]: df
Out[50]: 
   0  1        2         3
0  1  2      abc  def haha
1  3  1  foo bar   love it


来源:https://stackoverflow.com/questions/46522652/replacing-newlines-with-spaces-for-str-columns-through-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!