UTF-8 characters not showing properly

天涯浪子 提交于 2019-12-24 07:03:56

问题


I am using Nutch 1.4 and solr 3.3.0 to crawl and index my site which is in French. My site used to be in iso8859-1.

Currently I have 2 indexes under solr. In the first one I store my old pages (in iso8859-1) and in the second one I store my new pages (in utf-8).

I use the same nutch configurations for both of the crawl jobs to get and index the old and the new pages on my site. I have not added any settings about charters encodings on my own ( i think).

I am facing problem when searching the new pages thats supposed to be in utf-8. The french characters doesn't display properly. But for the old pages thats in iso8859-1 everything seems to be fine.

I was wondering if anyone could point me in the right direction for fixing this problem.

I believe the problem comes from the nutch since when I created the dump of the segments I saw those funny character in the dump file.

Thank you.


回答1:


In nutch-default.xml "parser.character.encoding.default" value should be set accordingly. You just have to set it to utf-8. Its default value is "windows-1252".




回答2:


I'm not as familiar with Nutch but I have seen this with other things.

A couple of things you should check or do:

  1. Your new pages on the web server may not be content negotiating that its UTF-8
  2. Your charset meta tags for the new pages may still be iso8859-1

What I recommend you do is take all the old pages of your old site and use a tool like iconv to convert them to UTF-8. Then in your web server configure it so that all text is treated as UTF-8 (that is the content-type header sent back says UTF-8).



来源:https://stackoverflow.com/questions/9825793/utf-8-characters-not-showing-properly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!