发表新帖

发表新帖

How to download any(!) webpage with correct charset in python?

前端未结

关注

 7  1894

醉酒成梦 2020-11-30 20:16

Problem

When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th

7条回答

时光取名叫无心 (楼主)

2020-11-30 20:55
Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, tags, BOM marks and differences in encoding names in account.

Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or or BOM marks are not available or provide no information.

You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.

To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:
```
import chardet
from w3lib.encoding import html_to_unicode

def _guess_encoding(data):
    return chardet.detect(data).get('encoding')

detected_encoding, html_content_unicode = html_to_unicode(
    content_type_header,
    html_content_bytes,
    default_encoding='utf8', 
    auto_detect_fun=_guess_encoding,
)
```
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题