问题
I have discovered that it is possible to insert invalid XML characters into a MarkLogic database. This only becomes apparent if I happen to extract, xdmp:quote then later xdmp:unquote an XML document, whereupon I get a message such as "Invalid character entity '14'".
The character got into the database via an XQuery-generated HTML form submission. I think the user pasted text in from Excel, which includes such hidden nasties.
Clearly I am going to need to check what is being input in future, but surely this is abug that should be fixed. If the characters are illegal, why isnt MarkLogic stripping them out when saving data to the database?
Neil.
回答1:
MarkLogic uses a parsed representation for XML both in memory and when persisting an XML document. Invalid characters would cause parse failures, preventing MarkLogic from storing a document as XML.
However, MarkLogic can store an invalid serialization of XML as a text or binary document. The bytes may be invalid for XML, but they aren't invalid for text or binary.
Is it possible that the HTML form submission submits the documents as text or binary instead of as XML? What does xdmp:node-kind()
report about the form submission and about the document when retrieved with fn:doc()
?
Hoping that helps with the investigation,
来源:https://stackoverflow.com/questions/57434493/why-do-invalid-characters-get-into-marklogic-database