jsoup times out, xml gets white space error, basic traversing through page is time consuming

五迷三道 提交于 2020-01-16 08:40:48

问题


I would like to make a program that parses the html page and selects useful information and displays it. I did it by opening a stream and then line by line searching for this appropriate content, but this is a time consuming process. So then I decided to do it by treating it as a xml and then using xpath. This I did by making a xml file on my system and loading the contents from the stream, and I got white space error, then I decide to direct open document as

doc = (Document) builder.parse(inputStream);

but the same error still persists. After asking here I was suggested to use jSoup for html parsing, now when I execute my code for:

Document doc= Jsoup.connect(url).get();

I get Read timed out. The same program when made in python and using a naive strategy like using find method of string and searching, I am displayed the contents and that too fast. How to make it work fast in java?

Complete code:

import java.io.*;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Parser {
public static void main(String[] args) {
    Validate.isTrue(true, "usage: supply url to fetch");
    try{
        String url="http://www.spoj.com/ranks/PRIME1/";
        Document doc= Jsoup.connect(url).get();
        Elements es=doc.getElementsByAttributeValue("class","lightrow");
        System.out.println(es.get(0).child(0).text());


    }catch(Exception e){e.printStackTrace();}
}

}

Exception:

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:412)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:393)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:159)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:148)
at Parser.main(Parser.java:12)

回答1:


Does your firewall or OS block your request (maybe it blocks java access to internet)? Are you using PC or eg. Android? And is your HTML page a website or a (local) HTML file? Please post some more code or the exception you get.

Please make shure you dont use a DOM Document but org.jsoup.nodes.Document.

I am displayed the contents

How do you want to display the content? If you simply need a value like this:

...
<div>some value</div>
...

You can do this with jsoup:

Document doc = ... // parse html file or connect to website

final String value = doc.select("div").first().text();

System.out.println(value);

Edit:

Since the default connection timeout is 3 sec (3000 millis) it should be changed for big websites, because loading the data may take some time:

final String url = "http://www.spoj.com/ranks/PRIME1/";
final int timeout = 4000; // or higher

Document doc = Jsoup.connect(url).timeout(4000).get();


来源:https://stackoverflow.com/questions/14155062/jsoup-times-out-xml-gets-white-space-error-basic-traversing-through-page-is-ti

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!