C# WebClient - DownloadString bad encoding

不问归期 提交于 2021-02-05 07:01:48

问题


I'm trying to download an html document from Amazon but for some reason I get a bad encoded string like "��K��g��g�e".

Here's the code I tried:

using (var webClient = new System.Net.WebClient())
{
    var url = "https://www.amazon.com/dp/B07H256MBK/";
    webClient.Encoding = Encoding.UTF8;
    var result = webClient.DownloadString(url);
}

Same thing happens when using HttpClient:

var url = "https://www.amazon.com/dp/B07H256MBK/";
var httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(url);

I also tried reading the result in Bytes and then convert it back to UTF-8 but I still get the same result. Also note that this DOES NOT always happen. For example, yesterday I was running this code for ~2 hours and I was getting a correctly encoded HTML document. However today I always get a bad encoded result. It happens every other day so it's not a one time thing.

==================================================================

However when I use the HtmlAgilitypack's wrapper it works as expected everytime:

var url = "https://www.amazon.com/dp/B07H256MBK/";
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(url);

What causes the WebClient and HttpClient to get a bad encoded string even when I explicitly define the correct encoding? And how does the HtmlAgilityPack's wrapper works by default?

Thanks for any help!


回答1:


I fired up Firefox's web dev tools, requested that page, and looked at the response headers:

See that content-encoding: gzip? That means the response is gzip-encoded.

It turns out that Amazon gives you a response compressed with gzip even when you don't send an Accept-Encoding: gzip header (verified with another tool). This is a bit naughty, but not that uncommon, and easy to work around.

This wasn't a problem with character encodings at all. HttpClient is good at figuring out the correct encoding from the Content-Type header.

You can tell HttpClient to un-zip responses with:

HttpClientHandler handler = new HttpClientHandler()
{
    AutomaticDecompression = DecompressionMethods.GZip,
};

using (var client = new HttpClient(handler))
{
    // your code
}

This will be set automatically if you're using the NuGet package versions 4.1.0 to 4.3.2, otherwise you'll need to do it yourself.

You can do the same with WebClient, but it's harder.



来源:https://stackoverflow.com/questions/60753996/c-sharp-webclient-downloadstring-bad-encoding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!