Returning a lower case ASCII string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

后端 未结 3 612
甜味超标
甜味超标 2020-12-06 23:24

I am fetching data from a web page using urllib2. The content of all the pages is in the English language so there is no issue of dealing with non-English text. The pages ar

3条回答
  •  广开言路
    2020-12-06 23:59

    Case-insensitive string search is more complicated than simply searching in the lower-cased variant. For example, a German user would expect to match both STRASSE as well as Straße with the search term Straße, but 'STRASSE'.lower() == 'strasse' (and you can't simply replace a double s with ß - there's no ß in Trasse). Other languages (in particular Turkish) will have similar complications as well.

    If you're looking to support other languages than English, you should therefore use a library that can handle proper casefolding (such as Matthew Barnett's regexp).

    That being said, the way to extract the page's content is:

    import contextlib
    def get_page_content(url):
      with contextlib.closing(urllib2.urlopen(url)) as uh:
        content = uh.read().decode('utf-8')
      return content
      # You can call .lower() on the result, but that won't work in general
    

提交回复
热议问题