问题
In Norway we have 3 highly annoying characters, æøå, that create all sorts of problems. Since sql server 2008, Microsoft decided to not support codepage 65001. I have found a manageable solution to the problem of importing a UTF-8 file into sql server with OPENROWSET(BULK) and keep the æøå tokens.
I created a powershell script that uses StreamReader and StreamWriter to convert the file from UTF-8 to default encoding, ANSI.
$filename = "C:\Test\UTF8_file.txt"
$outfile = "C:\Test\ANSI_file.txt"
$reader = new-object System.IO.StreamReader($filename, [System.Text.Encoding]::GetEncoding(65001))
$stream = new-object System.IO.StreamWriter($outfile, $false, [System.Text.Encoding]::Default)
I strip the file of the first line, the header row, in the same process.
$i=1
while(($line = $reader.ReadLine()) -ne $null) {
if($i -gt 1) {
$stream.WriteLine($line)
}
$i++
}
$reader.Close()
$stream.Close()
Then I am able to use OPENROWSET to import the ANSI file into sql server and manipulating data while doing so. Using codepage 1252, which equals danish_norwegian collation.
insert into SomeDatabase.dbo.SomeTable
SELECT [companynumber]
, case [role] when 'Styreformann' then 'Styreleder' when 'Styrets leder' then 'Styreleder' else rolle end as 'role'
, case [representant] when 'Y' then '1' else '0' end as 'representant'
, left((RIGHT('0000'+ CONVERT(VARCHAR,postnr),5)),4) end as 'postnr'
, income*1000 as income
, null as person2id
FROM OPENROWSET( BULK 'C:\Test\ANSI_file.txt',
FORMATFILE = 'C:\Test\FormatBulkInsert_file.xml'
, CODEPAGE =1252
, ROWS_PER_BATCH = 50000
) as v
This method secured that norwegian tokens were displayed correctly. The format file looks like this:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR=';"' />
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR='";"' />
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR='";"' />
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR='";' />
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR=';' />
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR='\n' />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="companynumber" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="role" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="3" NAME="representant" xsi:type="SQLBIT"/>
<COLUMN SOURCE="4" NAME="postnr" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="5" NAME="income" xsi:type="SQLDECIMAL"/>
<COLUMN SOURCE="6" NAME="person2id" xsi:type="SQLINT"/>
</ROW>
</BCPFORMAT>
Hope this is helpful to someone else, because I spent quite a lot of time googleing before I found a way to solve this issue.
回答1:
Convert into UTF16 instead. That is SQL Server's native NCHAR format, and allows full representation of Unicode values.
To make this work you will have to specify SQLNCHAR or SQLNVARCHAR in your format file, and also be aware of the caveat:
For a format file to work with a Unicode character data file, all the input fields must be Unicode text strings (that is, either fixed-size or character-terminated Unicode strings).
- http://msdn.microsoft.com/en-us/library/ms178129.aspx
An alternative is to load it as binary data and use the CONVERT function to convert it from VARBINARY to NVARCHAR (which is UTF-16) and then to the desired codepage as VARCHAR.
来源:https://stackoverflow.com/questions/15137558/how-to-handle-import-of-file-with-utf-8-encoding-codepage-65001-into-sql-ser