How to download any(!) webpage with correct charset in python?

前端 未结 7 1894
醉酒成梦
醉酒成梦 2020-11-30 20:16

Problem

When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th

7条回答
  •  时光取名叫无心
    2020-11-30 20:55

    Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, tags, BOM marks and differences in encoding names in account.

    Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or or BOM marks are not available or provide no information.

    You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.

    To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:

    import chardet
    from w3lib.encoding import html_to_unicode
    
    def _guess_encoding(data):
        return chardet.detect(data).get('encoding')
    
    detected_encoding, html_content_unicode = html_to_unicode(
        content_type_header,
        html_content_bytes,
        default_encoding='utf8', 
        auto_detect_fun=_guess_encoding,
    )
    

提交回复
热议问题