How can I escape *all* characters into their corresponding html entity names and numbers in Python?

时光怂恿深爱的人放手 提交于 2021-01-27 22:11:03

问题


I wanted to encode a string to its corresponding html entities but unfortunately I am not able to. As I said in question title, I want all characters in a string to be converted into their corresponding html entity(both numbers and names). So according to the documentation. I tried:

In [31]: import html

In [32]: s = '<img src=x onerror="javascript:alert("XSS")">'

In [33]: html.escape(s)
Out[33]: '&lt;img src=x onerror=&quot;javascript:alert(&quot;XSS&quot;)&quot;&gt;'

But I want all characters to be converted and not just '<' , '>', '&' ,etc. And also html.escape only gives html entity names and not numbers but I want both.

But surprisingly html.unescape unescapes all entities into their corresponding characters.

In [34]: a = '<img src=x onerror="&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#000005
    ...: 8&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041">'

In [35]: html.unescape(a)
Out[35]: '<img src=x onerror="javascript:alert(\'XSS\')">' 

So can I do the same with html.escape?

I am really surprised why all resources on internet for encoding and decoding html entities are not encoding all chars and also the php htmlspecialchars() function don't do that. And I don't want to write all html entity numbers from here character by character.


回答1:


You don't really need a special function for what you are doing because the numbers you want are just the Unicode code points of the characters in question.

ord does pretty much what you want:

 def encode(s):
     return ''.join('&#{:07d};'.format(ord(c)) for c in s)

Aesthetically, I prefer hex encoding:

 def encode(s):
     return ''.join('&#x{:06x};'.format(ord(c)) for c in s)

What is special about html.escape and html.unescape is that they support named entities in addition to the numerical ones. The goal of escaping is normally to turn your string into something that doesn't have characters special to the HTML parser, so escape only replaces a handful of characters. What you are doing ensures that all characters in the string are ASCII in addition to that.

If you want to force the use of named entities wherever possible, you can check the html.entities.codepoint2name mapping after applying ord to the characters:

def encode(s):
    return ''.join('&{};'.format(codepoint2name.get(i, '#{}'.format(i))) for i in map(ord, s))


来源:https://stackoverflow.com/questions/55494644/how-can-i-escape-all-characters-into-their-corresponding-html-entity-names-and

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!