HTML file fetched using 'wget' reported as binary by 'less'

瘦欲@ 提交于 2019-12-24 02:17:06

问题


If I use wget to download this page:

wget http://www.aqr.com/ResearchDetails.htm -O page.html

and then attempt to view the page in less, less reports the file as being a binary.

less page.html 
"page.html" may be a binary file.  See it anyway? 

These are the response headers:

Accept-Ranges:bytes
Cache-Control:private
Content-Encoding:gzip
Content-Length:8295
Content-Type:text/html
Cteonnt-Length:44064
Date:Sun, 25 Sep 2011 12:15:53 GMT
ETag:"c0859e4e785ecc1:6cd"
Last-Modified:Fri, 19 Aug 2011 14:00:09 GMT
Server:Microsoft-IIS/6.0
X-Powered-By:ASP.NET

Opening the file in vim works fine.

Any clues as to why less can not handle it?


回答1:


It's an UTF-16 encoded file. (Check with W3C Validator). You can convert it to UTF-8 with this command:

wget http://www.aqr.com/ResearchDetails.htm -q -O - | iconv -f utf-16 -t utf-8 > page.html

less usally knows UTF-8.

edit:

As @Stephen C reported, less in Red Hat supports UTF-16. It looks to me that Red Hat patched less for UTF-16 support. On the official site of the less UTF-16 support currently is an open issue (ref number 282).




回答2:


Because it is UTF-16 encoded as can be seen with the BOM of ff ee in the first two octets:

$ od -x page.html | head -1
0000000 feff 003c 0021 0044 004f 0043 0054 0059

vim is smarter about it (because it is more Unicode era) than less.

added:

See Convert UTF-16 to UTF-8 under Windows and Linux, in C for what to do about it. Or use vim to write it back out with UTF-8 encoding.




回答3:


Firstly, it works for me. When I download the file using that file, I get a file that "less" shows me without any questions / problems. (I use RedHat Fedora 14.)

Second, the "file" command reports "page.html" as:

page.html: Little-endian UTF-16 Unicode HTML document text, with very long lines, with CRLF line terminators

Maybe the UTF-16 encoding is the cause of the problems. (But why ... I don't know why it would work with my version of "less" and not yours.)


@palacsint's solution works for me:

wget http://www.aqr.com/ResearchDetails.htm -q -O - | \
     iconv -f utf-16 -t utf-8 > page.html



回答4:


Very likely this HTML file contains UTF characters and your locale is not set correctly (export LANG=en_US.UTF8 LESSCHARSET=utf-8). It may also happen that HTML contains invalid characters.

EDIT: After checking the file I clearly see it is UTF-16. So you need to correct your terminal settings correspondingly (although I was able to see the text correctly with UTF8 setting, perhaps my terminal program is smart).



来源:https://stackoverflow.com/questions/7545459/html-file-fetched-using-wget-reported-as-binary-by-less

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!