DOM parser in Arabic

妖精的绣舞 提交于 2020-01-26 04:03:06

问题


I have a problem in DOM parsing Arabic letters, I got weird characters. I've tried changing to different encoding but I couldn't.

the full code is on this link: http://test11.host56.com/parser.java

public Document getDomElement(String xml) {
    Document doc = null;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
   try {
       Reader reader = new InputStreamReader(new ByteArrayInputStream(
       xml.getBytes("UTF-8")));
       InputSource is = new InputSource(reader);

       DocumentBuilder db = dbf.newDocumentBuilder();

       //InputSource is = new InputSource();
       is.setCharacterStream(new StringReader(xml));
       doc = db.parse(is);

       return doc;
   }
}

my xml file

<?xml version="1.0" encoding="UTF-8"?>
<music>
<song>
    <id>1</id>    
    <title>اهلا وسهلا</title>
    <artist>بكم</artist>
    <duration>4:47</duration>
    <thumb_url>http://wtever.png</thumb_url>
</song>
</music>

回答1:


You already have the xml as String, so unless that string already contains the odd characters (that is, it has been read in with the wrong encoding), you can avoid encoding madness here by using a StringReader instead; e.g. instead of:

Reader reader = new InputStreamReader(new ByteArrayInputStream(
   xml.getBytes("UTF-8")));

use:

Reader reader = new StringReader(xml);

Edit: now that I see more of the code, it seems the encoding issue already happend before the XML is parsed, because that part contains:

HttpResponse httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
xml = EntityUtils.toString(httpEntity);

The javadoc for the EntityUtils.toString says:

The content is converted using the character set from the entity (if any), failing that, "ISO-8859-1" is used.

It seems the server does not send the proper encoding information with the entity, and then the HttpUtils uses a default, which is not UTF-8.

Fix: use the variant that takes an explicit default encoding:

xml = EntityUtils.toString(httpEntity, "utf-8");

Here I assume the server sends UTF-8. If the server uses a different encoding, that one should be set instead of UTF-8. (However as the XML also declares encoding="UTF-8" I thought this is the case.) If the encoding the server uses is not known, then you can only resort to wild guessing and are out of luck, sorry.




回答2:


If the XML contains Unicode characters such as Arabic or Persian letters, StringReader would make an exception. In these cases, pass the InputStream straightly to the Document object.



来源:https://stackoverflow.com/questions/14791206/dom-parser-in-arabic

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!