Reading the content of web page

江枫思渺然 提交于 2019-12-20 03:10:12

问题


Hi I want to read the content of a web page that contains a German characters using java , unfortunately , the German characters appear as strange characters . Any help please here is my code:

String link = "some german link";

            URL url = new URL(link);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

回答1:


You have to set the correct encoding. You can find the encoding in the HTTP header:

Content-Type: text/html; charset=ISO-8859-1

This may be overwritten in the (X)HTML document, see HTML Character encodings

I can imagine that you have to consider many different additional issues to pars a web page error free. But there are different HTTP client libraries available for Java, e.g. org.apache.httpcomponents. The code will look like this:

DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");

try
{
  HttpResponse response = httpclient.execute(httpGet);
  HttpEntity entity = response.getEntity();
  if (entity != null)
  {
    System.out.println(EntityUtils.toString(entity));
  }
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}

This is the maven artifact:

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.1.1</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>



回答2:


You need to specify the character set for your InputStreamReader, like

InputStreamReader(url.openStream(), "UTF-8") 



回答3:


Try to set an Charset.

new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));



回答4:


First, verify that the font you are using can support the particular German characters you are trying to display. Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.

If that's not the issue, then either you input or output is in the wrong character set. Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters). Java typically uses UTF-8 internally; so the output stream is likely not the issue. Check the input stream.



来源:https://stackoverflow.com/questions/6188901/reading-the-content-of-web-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!