Normalize whitespace with Python

不问归期 提交于 2020-01-03 17:07:41

问题


I'm building a data extract using scrapy and want to normalize a raw string pulled out of an HTML document. Here's an example string:

  Sapphire RX460 OC  2/4GB

Notice two groups of two whitespaces preceeding the string literal and between OC and 2.

Python provides trim as described in How do I trim whitespace with Python? But that won't handle the two spaces between OC and 2, which I need collapsed into a single space.

I've tried using normalize-space() from XPath while extracting data with my scrapy Selector and that works but the assignment verbose with strong rightward drift:

product_title = product.css('h3').xpath('normalize-space((text()))').extract_first()

Is there an elegant way to normalize whitespace using Python? If not a one-liner, is there a way I can break the above line into something easier to read without throwing an indentation error, e.g.

product_title = product.css('h3')
    .xpath('normalize-space((text()))')
    .extract_first()

回答1:


You can use:

" ".join(s.split())

where s is your string.




回答2:


Instead of using regex's for this, a more efficient solution is to use the join/split option, observe:

>>> timeit.Timer((lambda:' '.join(' Sapphire RX460 OC  2/4GB'.split()))).timeit()
0.7263979911804199

>>> def f():
        return re.sub(" +", ' ', "  Sapphire RX460 OC  2/4GB").split()

>>> timeit.Timer(f).timeit()
4.163465976715088



回答3:


You can use a function like below with regular expression to scan for continuous spaces and replace them by 1 space

import re

def clean_data(data):
    return re.sub(" {2,}", " ", data.strip())

product_title = clean(product.css('h3::text').extract_first())

And then improve clean function anyway you like it



来源:https://stackoverflow.com/questions/46501292/normalize-whitespace-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!