Unwanted replacement of html entities by BeautifulSoup

末鹿安然 提交于 2019-12-13 17:15:50

问题


I have some html containing mml that I am generating from Word documents using MathType. I have a python script that uses BeautifulSoup to prettify it, but the problem is it takes something like ∠ and turns it into the actual byte sequence 0xE2 0x88 0xA0 which is the ∠ symbol. This is a problem because 0xE2 0x88 0xA0 won't display as ∠ in the browser. Instead the browser interprets it as a series of latin characters. This is happening with all the math entities as well, such as Δ ∠ − +... etc.

I looked through the BeautifulSoup documentation and I can see how to turn entities into the byte sequences, but I'm not using that command; all I'm using is prettify(). And I didn't see a way in the BeautifulSoup documentation to not turn entities into byte sequences.

Does anyone know if there's a setting in BeautifulSoup to tell it not to change entities to byte sequences? I hope so because it seems kind of dumb to have to undo the damage after prettify runs :)

Thanks in advance for your help!


回答1:


I missed part of the BeautifulSoup documentation. The default output formatters do the described behaviour: they turn html entities into the unicode characters. So, this behaviour can be changed by using a different output formatter. (D'oh)

"You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode()...."

So if I pass in the formatter="html" Beautiful Soup will convert Unicode characters to HTML entities whenever possible! Yay! Thank you Beautiful Soup!

(And they have such great documentation. Pity I didn't read the whole thing sooner. :$)



来源:https://stackoverflow.com/questions/15840158/unwanted-replacement-of-html-entities-by-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!