Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

后端未结

关注

 3  612

甜味超标 2020-12-06 23:24

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages ar

3条回答

广开言路 (楼主)

2020-12-06 23:59
Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Straße with the search term Straße, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ß - there's no ß in Trasse). Other languages (in particular Turkish) will have similar complications as well.

If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).

That being said, the way to extract the page's content is:
```
import contextlib
def get_page_content(url):
  with contextlib.closing(urllib2.urlopen(url)) as uh:
    content = uh.read().decode('utf-8')
  return content
  # You can call .lower() on the result, but that won't work in general
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...