Trying to get encoding from a webpage Python and BeautifulSoup

谁都会走 提交于 2019-12-22 00:46:44

问题


Im trying to retrieve the charset from a webpage(this will change all the time). At the moment Im using beautifulSoup to parse the page and then extract the charset from the header. This was working fine until I ran into a site that had.....

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

My code up until now and which was working with other pages is:

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                encod = soup.meta.get('content')
    return encod

Would anyone have a good idea about how to add to this code to retrieve the charset from the above example. Would tokenizing it and trying to retrieve the charset that way be an idea? and how would you go about it without having to change the whole function? Right now the above code is returning "text/html; charset=utf-8" which is causing a LookupError because this is an unknown encoding.

Thanks

The final code that I ended up using:

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    dic_of_possible_encodings = chardet.detect(unicode(soup))
                    encod = dic_of_possible_encodings['encoding'] 
    return encod

回答1:


import re
def get_encoding(soup):
    encod = soup.meta.get('charset')
    if encod == None:
        encod = soup.meta.get('content-type')
        if encod == None:
            content = soup.meta.get('content')
            match = re.search('charset=(.*)', content)
            if match:
                encod = match.group(1)
            else:
                raise ValueError('unable to find encoding')
    return encod


来源:https://stackoverflow.com/questions/18359007/trying-to-get-encoding-from-a-webpage-python-and-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!