Jsoup not downloading entire page

独自空忆成欢 提交于 2020-01-04 04:05:09

问题


The webpage is: http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm

I want to extract all the <tr class="tr_normal"> elements using Jsoup.

The code I am using is:

Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());

But the size (1350) is smaller than actually have (1452). I copied this page onto my computer and deleted some <tr> elements. Then I ran the same code and it's correct. It looks like there are too many elements so jsoup can't read all of them?

So what's happened? Thanks!


回答1:


The problem is the internal Jsoup Http Connection Handling. Nothing wrong with the selector engine. I didn't go deep in but there always problem with proprietary way to handle http connection. I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection. The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample. Output of the program

  • load from file= 1452
  • load from http client= 1452
  • load from jsoup connect= 1350
  • load from jsoup connect using maxBodySize= 1452

    package test;
    
    import java.io.IOException;
    import java.io.InputStream;
    
    import org.apache.http.HttpResponse;
    import org.apache.http.client.ClientProtocolException;
    import org.apache.http.client.HttpClient;
    import org.apache.http.client.methods.HttpGet;
    import org.apache.http.impl.client.HttpClientBuilder;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    public class TestJsoup {
    
        /**
         * @param args
         * @throws IOException
         */
        public static void main(String[] args) throws IOException {
            Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
            Elements es = doc.getElementsByClass("tr_normal");
            System.out.println("load from file= " + es.size());
    
            doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from http client= " + es.size());
    
            String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                    + "/stockcode/eisdeqty_pf.htm";
            doc = Jsoup.connect(url).get();
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from jsoup connect= " + es.size());
    
            int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
            doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from jsoup connect using maxBodySize= " + es.size());
        }
    
        public static InputStream loadContentByHttpClient()
                throws ClientProtocolException, IOException {
            String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                    + "/stockcode/eisdeqty_pf.htm";
            HttpClient client = HttpClientBuilder.create().build();
            HttpGet request = new HttpGet(url);
            HttpResponse response = client.execute(request);
            return response.getEntity().getContent();
        }
    
        public static InputStream loadContentFromClasspath()
                throws ClientProtocolException, IOException {
            return TestJsoup.class.getClassLoader().getResourceAsStream(
                    "eisdeqty_pf.htm");
        }
    
    }
    


来源:https://stackoverflow.com/questions/23457942/jsoup-not-downloading-entire-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!