How to download any(!) webpage with correct charset in python?

前端 未结 7 1905
醉酒成梦
醉酒成梦 2020-11-30 20:16

Problem

When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th

7条回答
  •  悲&欢浪女
    2020-11-30 20:49

    Use the Universal Encoding Detector:

    >>> import chardet
    >>> chardet.detect(urlread("http://google.cn/"))
    {'encoding': 'GB2312', 'confidence': 0.99}
    

    The other option would be to just use wget:

      import os
      h = os.popen('wget -q -O foo1.txt http://foo.html')
      h.close()
      s = open('foo1.txt').read()
    

提交回复
热议问题