Kanji characters from WebClient html different from actual Kanji in website

天大地大妈咪最大 提交于 2019-11-27 15:52:22

The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.

Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?

The MSDN Docs state this:

This method retrieves the specified resource. After it downloads the resource, the method uses the encoding specified in the Encoding property to convert the resource to a String.

Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.

To verify this, check the .NET Reference Source for the WebClient.DownloadString method:

try {
    WebRequest request;
    byte [] data = DownloadDataInternal(address, out request);
    string stringData = GetStringUsingEncoding(request, data);
    if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
    return stringData;
    } finally {
        CompleteWebClientState();
    }

The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.

What you can do now is:
- Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
- Re-encode the string with the correct encoding.

This is a method to perform the latter task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.

EDIT:
Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.

using System.IO;
using System.Net;
using System.Reflection;
using System.Text;

public string WebClient_DownLoadString(Uri URI)
{
    using (WebClient webclient = new WebClient())
    {
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

        webclient.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
        webclient.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        webclient.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
        webclient.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");

        string result = webclient.DownloadString(URI);
        using (HttpWebResponse wc_response = (HttpWebResponse)webclient
                        .GetType()
                        .GetField("m_WebResponse", BindingFlags.Instance | BindingFlags.NonPublic)
                        .GetValue(webclient))
        {
            Encoding PageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
            byte[] bresult = webclient.Encoding.GetBytes(result);
            using (MemoryStream memstream = new MemoryStream(bresult, 0, bresult.Length))
            using (StreamReader reader = new StreamReader(memstream, PageEncoding))
            {
                memstream.Position = 0;
                return reader.ReadToEnd();
            };
        };
    }
}

Now your code should get the Japanese characters in their correct form.

Uri URI = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(URI);

kanji = kanji.Remove(0, kanji.IndexOf(@"<div class=""glyph"">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();

Text_DailyKanji.Text = kanji; // Set the Kanji
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!