Decoding HTML entities with Python

前端 未结 4 1029
野趣味
野趣味 2020-12-04 16:46

I\'m trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.

Take for example:

\"U.S. Adviser’         


        
4条回答
  •  青春惊慌失措
    2020-12-04 17:36

    This does work:

    from BeautifulSoup import BeautifulStoneSoup
    s = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
    decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
    

    If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:

    result = decoded.encode("UTF-8")
    

    It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)

提交回复
热议问题