Japanese/Chinese language data in SQL Server table

问题

So I've got an interesting problem that I need help with faster than I can get my skills with SQL Server up to par.

We have a table that contains a bunch of text, all of it in different languages. Most of this data appears correctly in the browser, however, anything in Chinese or Japanese gets completely mangled by the browser.

This is an ASP.old app that we're using to display the data which comes from a server running MS SQL Server 2005.

Before, we had this same problem and we solved it by changing the encoding in the ASP pages. These files have not changed since we did that, but the problem has resurfaced. Thus I must conclude that the problem rests with the database since that's the only thing that has been updated since we last fixed it.

So far I've been trying to look into collation, but I'm nowhere near an expert on SQL so it's been difficult.

I can provide more info if needed, anything that will help someone get me to the answer, short of URL's (confidentiality and all).

If anyone has any ideas, I would appreciate it very much.

ADDITIONAL INFO:

-column type is 'ntext'

回答1:

Collation only affects sort order, not encoding. You need to determine what the encoding of your chinese and japenese content is (see this). If it is not UCS-2, you have a problem (since you cannot support multiple page encodings concurrently). If it is UCS-2, you need to make sure that the encoding of your ASP page is also set to UTF-8 (and that the browser recognizes that by correctly setting the encoding to UTF-8 - see View/Encoding).

Or in simpler terms: if the application that created the content did not use Unicode characters, you will have to switch page encoding if you switch between Chinese, Japanese, and European characters.

If you have correctly encoded Unicode content in your database, and you use UTF-8 encoding on your pages, you should not have a problem with displaying any special characters (as long as you use a Unicode font on the page):

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

I realize that desite several edits I am not being very clear, so let me add some basics.

A character set is a standardized representation of a set of characters (e.g. ASCII, UNICODE, ...).

Character encoding is the binary representation used to store characters of a given character set. ASCII has its own encoding. Unicode, which is a very large character set designed to support all characters in existance, has several encodings (UTF-8, UTF-16, UCS-2, ...).

Only Unicode gives you the ability to support Western and Far eastern content at the same time with the same database and application settings. There are, however, older character sets for the Chinese and Japenese language that are not Unicode. If your content is not Unicode (BIG 5, for instance), you cannot display it on a UTF-8 encoded web page.

This can become tricky if the application that created the content used one encoding (e.g. BIG-5) and the database stored it as Unicode data. If this happens, information could have been lost.

You even have to install the corresponding language packs in Windows in order to see the characters correctly. Unfortunately, encoding issues are not simple to diagnose.

回答2:

There could be a couple of issues here, but since you say that you resolved this before, it may simply be a browser display issue. You should make sure that you have the encoding set correctly and language packs installed. You could check this on a couple of different computers and browsers to determine if its a problem with a specific machine, browser, or a general issue.

Otherwise, are you using nvarchar or ntext fields in all of your database tables? If not, then you're losing the chinese and japanese characters at that level. Also, if you're using any stored procedures, functions, etc. you need to make sure that variables are also nvarchar or ntext.

Finally, rechecl that your ASP pages are preserving the encoding in all places. I'm not very familiar with ASP classic, so I will let someone else help with that.

回答3:

Do you have the following in your ASP files?

<%@codepage=65001%>
Session.CodePage = 65001

回答4:

ntext has been deprecated in SQL 2005 (http://geekswithblogs.net/johnsPerfBlog/archive/2008/04/16/ntext-vs-nvarcharmax-in-sql-2005.aspx). Not sure if it helps but you can try converting ntext to nvarchar.

回答5:

You said you cannot even read it from Management Studio. It is very important to check that is there any data lost already.

In order to know how to restore it, you have to know how it become corrupted.

How did these words write to database? any transcoding (including hidden by ASP) has been done before it has been wrote to DB?
What is actually stored in the database already? You can get first two/three bytes of the "broken" words, and compare their byte range to common charset.

If the data are came from browser, you should check the encoding of page of the form. Browsers use the encoding of the page to encoding and submit data. If the charset/encoding does not match the receiver (e.g. your ASP page), it may decoded the words incorrectly.

回答6:

If you modified the database, then the most likely culprit is in the storage of the fields. You may pass the fields via a variable that is not ntext, but rather just text or varchar. That will kill the data going in, and then it will look wrong coming back on the web page.

What do you use to insert the data into the database?

回答7:

I suspect you have several problems.

There are actually several common ways to represent Japanese and Chinese text, using legacy encodings (Shift_JIS, EUC-JP, and JIS-variants for Japanese, and several others for Chinese) or Unicode (UTF-8 or UTF-16). For a multilingual application, the preferred solution is to transmit page content in UTF-8; Windows itself prefers to store content in UTF-16 (which is what NTEXT and NVARCHAR use in MS SQL Server).

In order to get Japanese content to display correctly, you need to make sure the proper conversions are happening at every stage in your data pipeline. Let's assume that you're going to use Unicode for the sake of sanity, but the answer would be similar if you intentionally elected to use Shift-JIS, big5, gb2312 or something, just more complicated.

If your data is primarily coming from web forms, you need to make sure that your codepage is set to 65001, usually using the <%@codepage=65001%> directive at the top of each ASP file.

Additionally, you need to provide a hint to your user-agents (the web browser) that you are using UTF-8. There are two techniques, one involving an HTTP header; the other option is to fake the HTTP header with a meta tag.

The meta tag solution:

The HTTP header solution, using my rusty ASP skills (assuming javascript, but you're probably using vbscript, which would require you to drop the semicolons) Response.ContentType="text/html"; Response.Charset="utf-8";

If you are taking data into MSSQL in feeds, rather than web forms, you'll also need to make sure the data is converted properly. Depending on your import mechanism, the method for specifying the source encoding is different, so I'll have to leave that as an "exercise for the reader."

Next, when submitting your data to SQL server, you need to make sure you're using the correct SQL input mechanism. If you're not parameterizing your queries (and you should be), you need to remember to use the N'MyText' form rather than 'MyText' when putting text parameters in your query. If you're parameterizing your text, when you're using adVarChar, you should be using adVarWChar instead. (There are corresponding "W" types for each ADO data type).

Additionally, some browsers use the HTML LANG attribute as a hint for displaying text in a suitable font for the content language. If you happen to know which language your content is in, you can add LANG="ja-jp" to any HTML element (including BODY). Then a reasonable default font for that language should be used by the browser (but you can explicitly specify one if you like). Most browsers made in the last 5 years do some font-linking magic even if you choose an inappropriate default font for a particular language, but you'll get more reliable results and slightly better rendering performance if you use an appropriate font.

As an additional note, If you're getting almost-correct results when manually forcing the encoding as shift-jis on the browser, that means that you are probably using windows-1252 as your charset <%@codepage=1252%> and that you're getting lucky that the content hasn't been messed up entirely. There are a couple of hacks that can restore hosed Shift-Jis-in-1252 or iso-8859-1, but they're not 100% reliable.

As for collation on SQL server, this has two impacts. On NVARCHAR and NTEXT fields, it only affects sorting and querying (including case, accent and kana-sensitivity). On varchar and text fields, it also affects the encoding, but it's not the most sensible solution to your problem.

来源：https://stackoverflow.com/questions/569845/japanese-chinese-language-data-in-sql-server-table

标签

sql-server-2005

unicode

asp-classic

collation