I\'m trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.
Take for example:
\"U.S. Adviser’
Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example
all mean U+00A0 NO-BREAK SPACE.
(the type you have) is a "numeric character reference" (decimal).
is a "numeric character reference" (hexadecimal).
is an entity.
Further reading: http://htmlhelp.com/reference/html40/entities/
Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html