random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

匿名 (未验证) 提交于 2019-12-03 09:05:37

问题:

I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error:

print repr(comment) import html5lib print html5lib.parse(comment, treebuilder="lxml")  'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C'  Unhandled Error     Traceback (most recent call last):       File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 893, in _inlineCallbacks         result = g.send(result)       File "/home/work/random/social/social/item.py", line 389, in _new         convId, conv = yield plugin.create(request)       File "/home/work/random/social/social/logging.py", line 47, in wrapper         ret = func(*args, **kwargs)       File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 1014, in unwindGenerator         return _inlineCallbacks(None, f(*args, **kwargs), Deferred())     --- <exception caught here> ---       File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", line 893, in _inlineCallbacks         result = g.send(result)       File "/home/work/random/social/twisted/plugins/status.py", line 63, in create         print html5lib.parse(comment, treebuilder="lxml")       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 38, in parse         return p.parse(doc, encoding=encoding)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 211, in parse         parseMeta=parseMeta, useChardet=useChardet)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 111, in _parse         self.mainLoop()       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 174, in mainLoop         self.phase.processCharacters(token)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 572, in processCharacters         self.parser.phase.processCharacters(token)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 611, in processCharacters         self.parser.phase.processCharacters(token)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 652, in processCharacters         self.parser.phase.processCharacters(token)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 711, in processCharacters         self.parser.phase.processCharacters(token)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 804, in processCharacters         self.parser.phase.processCharacters(token)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/html5parser.py", line 948, in processCharacters         self.tree.insertText(token["data"])       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/_base.py", line 288, in insertText         parent.insertText(data)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/etree_lxml.py", line 225, in insertText         builder.Element.insertText(self, data, insertBefore)       File "/usr/local/lib/python2.6/dist-packages/html5lib-0.90-py2.6.egg/html5lib/treebuilders/etree.py", line 114, in insertText         self._element.text += data       File "lxml.etree.pyx", line 821, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:33308)        File "apihelpers.pxi", line 646, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:15287)        File "apihelpers.pxi", line 1295, in lxml.etree._utf8 (src/lxml/lxml.etree.c:20212)      exceptions.ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes 

Before I am committing a user entered string, I am doing this:

comment.decode('utf-8').encode('utf-8', "replace")

but this does not seem to be helping in this case.

-- Abhi

回答1:

The problem is that text in XML cannot include certain characters mainly control ones with byte value below 32 The XML 1.0 Recommendation defines a Char as

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

/dev/random can provide bytes that don't match this e.g. control characters and some multi byte characters.

So you have to filter out these bytes before trying any encoding.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!