how to decode and encode web page with python?

前端 未结 3 2193
慢半拍i
慢半拍i 2021-01-07 06:19

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu\'s home page, w

3条回答
  •  滥情空心
    2021-01-07 06:33

    Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:

    import urllib2
    from bs4 import BeautifulSoup
    
    html = urllib2.urlopen('http://www.sohu.com').read()
    soup = BeautifulSoup(html)
    
    >>> soup.original_encoding
    u'gbk'
    

    And this agrees with the encoding declared in the tag in the HTML's :

    
    
    >>> soup.meta['content']
    u'text/html; charset=GBK'
    

    Now you can decode the HTML:

    decoded_html = html.decode(soup.original_encoding)
    

    but there not much point since the HTML is already available as unicode:

    >>> soup.a['title']
    u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
    >>> print soup.a['title']
    搜狐-中国最大的门户网站
    >>> soup.a.text
    u'\u641c\u72d0'
    >>> print soup.a.text
    搜狐
    

    It is also possible to attempt to detect it using the chardet module (although it is a bit slow):

    >>> import chardet
    >>> chardet.detect(html)
    {'confidence': 0.99, 'encoding': 'GB2312'}
    

提交回复
热议问题