Giving an url, that redirected is a url with spaces, to Jsoup leads to an error. How resolve this?

后端 未结 2 766
小蘑菇
小蘑菇 2020-12-11 10:16

Hello I have to parse pages wich URI is resolved by server redirect.

Example:

I have http://www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56

相关标签:
2条回答
  • 2020-12-11 10:55

    Try this Instead

    String url = "http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera%20convocati%20villar%20news%2010agosto2013";
            Document doc = Jsoup.connect(url)
            .data("pragma", "no-cache")
            .get();
    
            Element img = doc.select(".juveShareImage").first();
    
            String imgurl = img.absUrl("src");
            System.out.println(imgurl);
    
    0 讨论(0)
  • 2020-12-11 11:12

    You are right. This is the problem. The only solution I see is to do the redirects manual. I wrote this small recursive method doing this for you. See:

    public static void main(String[] args) throws IOException
    {
        String url = "http://www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56671b7020";
    
        Document document = manualRedirectHandler(url);
    
        Elements elements = document.getElementsByClass("juveShareImage");
    
        for (Element element : elements)
        {
            System.out.println(element.attr("src"));
        }
    
    }
    
    private static Document manualRedirectHandler(String url) throws IOException
    {
        Response response = Jsoup.connect(url.replaceAll(" ", "%20")).followRedirects(false).execute();
        int status = response.statusCode();
    
        if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM || status == HttpURLConnection.HTTP_SEE_OTHER)
        {
            String redirectUrl = response.header("location");
            System.out.println("Redirect to: " + redirectUrl);
            return manuelRedirectHandler(redirectUrl);
        }
    
        return Jsoup.parse(response.body());
    }
    

    This will print you

    Redirect to: http://www.juventus.com:80/wps/portal/!ut/p/b0/DcdJDoAgEATAF00GXFC8-QqVWwMuJLLEGP2-1q3Y8Mwm4Qk77pATzv_L6-KQgx-09FDeWmpEr6nRThCk36hGq1QnbScqwRMbNuXCHsFLyuTgjpVLjOMHyfCBUg!!/
    Redirect to: http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera convocati villar news 10agosto2013?pragma=no-cache
    /resources/images/news/inlined/42d386ef-1443-488d-8f3e-583b1e5eef61.jpg
    

    I also added a patch for Jsoup for that:

    • https://github.com/jhy/jsoup/pull/354
    0 讨论(0)
提交回复
热议问题