how to decode and encode web page with python?

前端 未结 3 2179
慢半拍i
慢半拍i 2021-01-07 06:19

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu\'s home page, w

3条回答
  •  粉色の甜心
    2021-01-07 06:45

    I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html

    Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:

    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://url_youre_using_here.html'
    readOut = requests.get(url)
    readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
    soup = BeautifulSoup(readOut.text, "lxml")
    

提交回复
热议问题