When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th
Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, tags, BOM marks and differences in encoding names in account.
Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or or BOM marks are not available or provide no information.
You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.
To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:
import chardet
from w3lib.encoding import html_to_unicode
def _guess_encoding(data):
return chardet.detect(data).get('encoding')
detected_encoding, html_content_unicode = html_to_unicode(
content_type_header,
html_content_bytes,
default_encoding='utf8',
auto_detect_fun=_guess_encoding,
)