Why Do Invalid Characters Get Into MarkLogic Database?

烂漫一生 提交于 2019-12-11 16:39:02

问题


I have discovered that it is possible to insert invalid XML characters into a MarkLogic database. This only becomes apparent if I happen to extract, xdmp:quote then later xdmp:unquote an XML document, whereupon I get a message such as "Invalid character entity '14'".

The character got into the database via an XQuery-generated HTML form submission. I think the user pasted text in from Excel, which includes such hidden nasties.

Clearly I am going to need to check what is being input in future, but surely this is abug that should be fixed. If the characters are illegal, why isnt MarkLogic stripping them out when saving data to the database?

Neil.


回答1:


MarkLogic uses a parsed representation for XML both in memory and when persisting an XML document. Invalid characters would cause parse failures, preventing MarkLogic from storing a document as XML.

However, MarkLogic can store an invalid serialization of XML as a text or binary document. The bytes may be invalid for XML, but they aren't invalid for text or binary.

Is it possible that the HTML form submission submits the documents as text or binary instead of as XML? What does xdmp:node-kind() report about the form submission and about the document when retrieved with fn:doc()?

Hoping that helps with the investigation,



来源:https://stackoverflow.com/questions/57434493/why-do-invalid-characters-get-into-marklogic-database

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!