How to download any(!) webpage with correct charset in python?

前端 未结 7 1892
醉酒成梦
醉酒成梦 2020-11-30 20:16

Problem

When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th

7条回答
  •  失恋的感觉
    2020-11-30 20:47

    When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:

    fp = urllib2.urlopen(request)
    charset = fp.headers.getparam('charset')
    

    You can use BeautifulSoup to locate a meta element in the HTML:

    soup = BeatifulSoup.BeautifulSoup(data)
    meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
    

    If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.

提交回复
热议问题