How to download any(!) webpage with correct charset in python?

前端 未结 7 1893
醉酒成梦
醉酒成梦 2020-11-30 20:16

Problem

When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong th

7条回答
  •  抹茶落季
    2020-11-30 20:50

    instead of trying to get a page then figuring out the charset the browser would use, why not just use a browser to fetch the page and check what charset it uses..

    from win32com.client import DispatchWithEvents
    import threading
    
    
    stopEvent=threading.Event()
    
    class EventHandler(object):
        def OnDownloadBegin(self):
            pass
    
    def waitUntilReady(ie):
        """
        copypasted from
        http://mail.python.org/pipermail/python-win32/2004-June/002040.html
        """
        if ie.ReadyState!=4:
            while 1:
                print "waiting"
                pythoncom.PumpWaitingMessages()
                stopEvent.wait(.2)
                if stopEvent.isSet() or ie.ReadyState==4:
                    stopEvent.clear()
                    break;
    
    ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
    ie.Visible = 0
    ie.Navigate('http://kskky.info')
    waitUntilReady(ie)
    d = ie.Document
    print d.CharSet
    

提交回复
热议问题