Decoding HTML entities with Python

前端 未结 4 1012
野趣味
野趣味 2020-12-04 16:46

I\'m trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.

Take for example:

\"U.S. Adviser’         


        
4条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-04 17:34

    Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example       all mean U+00A0 NO-BREAK SPACE.

      (the type you have) is a "numeric character reference" (decimal).
      is a "numeric character reference" (hexadecimal).
      is an entity.

    Further reading: http://htmlhelp.com/reference/html40/entities/

    Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html

提交回复
热议问题