问题
I'm trying to create a stored procedure in SQL Server 2016 that converts XML that was previously converted into Varbinary back into XML, but getting an "Illegal XML character" error when converting. I've found a workaround that seems to work, but I can't actually figure out why it works, which makes me uncomfortable.
The stored procedure takes data that was converted to binary in SSIS and inserted into a varbinary(MAX) column in a table and performs a simple
CAST(Column AS XML)
It worked fine for a long time, and I only began seeing an issue when the initial XML started containing an ® (registered trademark) symbol.
Now, when I attempt to convert the binary to XML I get this error
Msg 9420, Level 16, State 1, Line 23
XML parsing: line 1, character 7, illegal xml character
However, if I first convert the binary to varchar(MAX), then convert that to XML, it seems to work fine. I don't understand what is happening when I perform that intermediate CAST that is different than casting directly to XML. My main concern is that I don't want to add it in to account for this scenario and end up with unintended consequences.
Test code:
DECLARE @foo VARBINARY(MAX)
DECLARE @bar VARCHAR(MAX)
DECLARE @Nbar NVARCHAR(MAX)
--SELECT Varbinary
SET @foo = CAST( '<Test>®</Test>' AS VARBINARY(MAX))
SELECT @foo AsBinary
--select as binary as varchar
SET @bar = CAST(@foo AS VARCHAR(MAX))
SELECT @bar BinaryAsVarchar -- Correct string output
--select binary as nvarchar
SET @nbar = CAST(@foo AS NVARCHAR(MAX))
SELECT @nbar BinaryAsNvarchar -- Chinese characters
--select binary as XML
SELECT TRY_CAST(@foo AS XML) BinaryAsXML -- ILLEGAL XML character
-- SELECT CONVERT(xml, @obfoo) BinaryAsXML --ILLEGAL XML Character
--select BinaryAsVarcharAsXML
SELECT TRY_CAST(@bar AS XML) BinaryAsVarcharAsXML -- Correct Output
--select BinaryAsNVarcharAsXML
SELECT TRY_CAST(@nbar AS XML) BinaryAsNvarcharAsXML -- Chinese Characters
回答1:
There are several things to know:
- SQL-Server is rather limited with character encodings. There is
VARCHAR, which is 1-byte-encoded extended ASCII andNVARCHAR, which isUCS-2(almost the same asutf-16). VARCHARuses plain latin for the first set of characters and a codepage-mapping provided by the collation in use for the second set.VARCHARis not utf-8.utf-8works withVARCHAR, as long as all characters are 1-byte-enocded. Bututf-8knows a lot of 2-byte-enocded (up to 4-byte-enocded) characters, which would break the internal storage of aVARCHARstring.NVARCHARwill work with almost any 2-byte encoded character natively (that means with almost any existing character). But it is not exactlyutf-16(there are 3-byte encoded characters, which would break SQL-Servers internal storage).- XML is not stored as the XML-string you see, but as an hierarchically organised physical table, based on
NVARCHARvalues. - The natively stored XML is really fast, while any text-based storage will need a very expensive parse-operation in advance (over and over...).
- Storing XML as string is bad, storing XML as
VARCHARstring is even worse. - Storing a
VARCHAR-string-XML asVARBINARYis a cummulation of things you should not do.
Try this:
DECLARE @text1Byte VARCHAR(100)='<test>blah</test>';
DECLARE @text2Byte NVARCHAR(100)=N'<test>blah</test>';
SELECT CAST(@text1Byte AS VARBINARY(MAX)) AS text1Byte_Binary
,CAST(@text2Byte AS VARBINARY(MAX)) AS text2Byte_Binary
,CAST(@text1Byte AS XML) AS text1Byte_XML
,CAST(@text2Byte AS XML) AS text2Byte_XML
,CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS XML) AS text1Byte_XML_via_Binary
,CAST(CAST(@text2Byte AS VARBINARY(MAX)) AS XML) AS text2Byte_XML_via_Binary
The only difference you'll see are the many zeros in 0x3C0074006500730074003E0062006C00610068003C002F0074006500730074003E00. This is due to the 2-byte-encoding of nvarchar, each second byte is not needed in this sample. But if you'd need far-east-characters the picture would be completely different.
The reason why it works: SQL-Server is very smart. The cast from the variable to XML is rather easy, as the engine knows, that the underlying variable is varchar or nvarchar. But the last two casts are different. The engine has to examine the binary, whether it is a valid nvarchar and will give it a second try with varchar if it fails.
Now try to add your registered trademark to the given example. Add it first to the second variable DECLARE @text2Byte NVARCHAR(100)=N'<test>blah®</test>'; and try to run this. Then add it to the first variable and try it again.
What you can try:
Cast your binary to varchar(max), then to nvarchar(max) and finally to xml.
,CAST(CAST(CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS VARCHAR(MAX)) AS NVARCHAR(MAX)) AS XML) AS text1Byte_XML_via_Binary
This will work, but it won't be fast...
来源:https://stackoverflow.com/questions/53123042/tsql-illegal-xml-character-when-converting-varbinary-to-xml