One specific site which Http Response (hebrew) characters do not come property encoded

问题

The following has been amusing me for a while now.

First of all, I have been scraping sites for a couple of months. Among them hebrew sites as well, and had no problem whatsoever in receiving hebrew characters from the http server.

For some reason I am very curious to sort out, the following site is an exception. I can't get the characters properly encoded. I tried emulating the working requests I do via Fiddler, but to no avail. My c# request headers look exactly the same, but still the characters will not be readable.

What I do not understand is why I have always been able to retrieve hebrew characters from other sites, while from this one specifically I am not. What is this setting that is causing this.

Try the following sample out.

    HttpClient httpClient = new HttpClient();
    httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html;q=0.9");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.5");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");

    var getTask = httpClient.GetStringAsync("http://winedepot.co.il/Default.asp?Page=Sale");

    //doing it like this for the sake of the example
    var contents = getTask.Result;

    //add a breakpoint at the following line to check the contents of "contents"
    Console.WriteLine();

As mentioned, such code works for any other israeli site I try - say, Ynet news site, for instance.

Update: I figured out while "debugging" with Fiddler that the response object, for the ynet site (one which works), returns the header

Content-Type: text/html; charset=UTF-8

while this header is absent in the response from winedepot.co.il

I tried adding it, but still made no difference.

 var getTask = httpClient.GetAsync("http://www.winedepot.co.il");

    var response = getTask.Result;

    var contentObj = response.Content;
    contentObj.Headers.Remove("Content-Type");
    contentObj.Headers.Add("Content-Type", "text/html; charset=UTF-8");

    var readTask = response.Content.ReadAsStringAsync();
    var contents = readTask.Result;
    Console.WriteLine();

回答1:

The problem you're encountering is that the webserver is lying about its content-type, or rather, not being specific enough.

The first site responds with this header:

Content-Type: text/html; charset=UTF-8

The second one with this header:

Content-Type: text/html

This means that in the second case, your client will have to make assumptions about what encoding the text is actually in. To learn more about text encodings, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

And the built-in HTTP clients for .NET don't really do a great job at this, which is understandable, because it is a Hard Problem. Read the linked article for the trouble a web browser will have to go through in order to guess the encoding, and then try to understand why you don't want this logic in a programmable web client.

Now the sites do provide you with a <meta http-equiv="Content-Type" content="actual encoding here" /> tag, which is a nasty workaround for not having to properly configure a web server. When a browser encounters such a tag, it will have to restart parsing the document with the specified content-type, and then hope it is correct.

The steps roughly are, assuming an HTML payload:

Perform web request, keep the response document in a binary buffer.
Inspect the content-type header, if present, and if it isn't present or doesn't provide a charset, do some assumption about the encoding.
Read the response by decoding the buffer, and parsing the resulting HTML.
When encountering a <meta http-equiv="Content-Type" /> header, discard all decoded text, and start again by interpreting the binary buffer as text encoded in the specified encoding.

The C# HTTP clients stop at step 2, and rightfully so. They are HTTP clients, not HTML-displaying browsers. They don't care that your payload is HTML, JSON, XML, or any other textual format.

When no charset is given in the content-type response header, the .NET HTTP clients default to the ISO-8859-1 encoding, which cannot display the characters from the character set Windows-1255 (Hebrew) that the page actually is encoded in (or rather, it has different characters at the same code points).

Some C# implementations that try to do encoding detection from the meta HTML element are provided in Encoding trouble with HttpWebResponse. I cannot vouch for their correctness, so you'll have to try it at your own risk. I do know that the currently highest-voted answer actually re-issues the request when it encounters the meta tag, which is quite silly, because there is no guarantee that the second response will be the same as the first, and it's just a waste of bandwidth.

You can also do some assumption about that you know the encoding being used for a certain site or page, and then force the encoding to that:

using (Stream resStream = response.GetResponseStream())
{
    StreamReader reader = new StreamReader(resStream, YourFixedEncoding);
    string content = reader.ReadToEnd();
}

Or, for HttpClient:

using (var client = new HttpClient())
{
    var response = await client.GetAsync(url);
    var responseStream = await client.ReadAsStreamAsync();
    using (var fixedEncodingReader = new StreamReader(responseStream, Encoding.GetEncoding(1255)))
    {
        string responseString = fixedEncodingReader.ReadToEnd();
    }
}

But assuming an encoding for a particular response, or URL, or site, is entirely unsafe altogether. It is in no way guaranteed that this assumption will be correct every time.

来源：https://stackoverflow.com/questions/36327747/one-specific-site-which-http-response-hebrew-characters-do-not-come-property-e

标签

character-encoding

http-headers