How to correctly read url content with utf8 chars?

匆匆过客 提交于 2019-12-09 02:02:23

问题


    public class URLReader {
         public static byte[] read(String from, String to, String string){
          try {
           String text = "http://translate.google.com/translate_a/t?"+
                        "client=o&text="+URLEncoder.encode(string, "UTF-8")+
                        "&hl=en&sl="+from+"&tl="+to+"";

           URL url = new URL(text);
           BufferedReader in = new BufferedReader(
                        new InputStreamReader(url.openStream(), "UTF-8"));
           String json = in.readLine();
           byte[] bytes = json.getBytes("UTF-8");
           in.close();
           return bytes;
                    //return text.getBytes();
          }
          catch (Exception e) {
           return null;
          }
         }
        }

and:

public class AbcServlet extends HttpServlet {
 public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
  resp.setContentType("text/plain;charset=UTF-8");
  resp.getWriter().println(new String(URLReader.read("pl", "en", "koń")));
 }
}

When I run this i get:{"sentences"[{"trans":"end","orig":"koďż˝","translit":"","src_translit":""}],"src":"pl","server_time":30} so utf doesnt work correctly but if i return encoded url: http://translate.google.com/translate_a/t?client=o&text=ko%C5%84&hl=en&sl=pl&tl=en and paste at url bar i get correctly:{"sentences":[{"trans":"horse","orig":"koń","translit":"","src_translit":""}],"dict":[{"pos":"noun","terms":["horse"]}],"src":"pl","server_time":76}


回答1:


byte[] bytes = json.getBytes("UTF-8");

gives you a UTF-8 bytes sequences so URLReader.read also give you UTF-8 bytes sequences

but you tried to decode with without specifying the encoder, i.e. new String(URLReader.read("pl", "en", "koń")) so Java will use your system default encoding to decode (which is not UTF-8)

Try :

new String(URLReader.read("pl", "en", "koń"), "UTF-8")

Update

Here is fully working code on my machine:

public class URLReader {

    public static byte[] read(String from, String to, String string) {
        try {
            String text = "http://translate.google.com/translate_a/t?"
                    + "client=o&text=" + URLEncoder.encode(string, "UTF-8")
                    + "&hl=en&sl=" + from + "&tl=" + to + "";
            URL url = new URL(text);
            URLConnection conn = url.openConnection();
            // Look like faking the request coming from Web browser solve 403 error
            conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
            String json = in.readLine();
            byte[] bytes = json.getBytes("UTF-8");
            in.close();
            return bytes;
            //return text.getBytes();
        } catch (Exception e) {
            System.out.println(e);
            // becarful with returning null. subsequence call will return NullPointException.
            return null;
        }
    }
}

Don't forget to escape ń to \u0144. Java compiler may not compile Unicode text properly so it is good idea to write it in plain ASCII.

public class AbcServlet extends HttpServlet {

    @Override
    public void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
        resp.setContentType("text/plain;charset=UTF-8");
        byte[] read = URLReader.read("pl", "en", "ko\u0144");
        resp.getOutputStream().write(read) ;
    }
}


来源:https://stackoverflow.com/questions/4555128/how-to-correctly-read-url-content-with-utf8-chars

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!