can jsoup handle meta refresh redirect

前端 未结 2 2040
一生所求
一生所求 2020-12-09 21:50

I have a problem using jsoup what I am trying to do is fetch a document from the url which will redirect to another url based on meta refresh url which is not working, to ex

相关标签:
2条回答
  • 2020-12-09 22:45

    to have a better error handling and case sensitivity problem

    try
    {
        Document doc = Jsoup.connect("http://www.ibm.com").get();
        Elements meta = doc.select("html head meta");
        if (meta != null)
        {
            String lvHttpEquiv = meta.attr("http-equiv");
            if (lvHttpEquiv != null && lvHttpEquiv.toLowerCase().contains("refresh"))
            {
                String lvContent = meta.attr("content");
                if (lvContent != null)
                {
                    String[] lvContentArray = lvContent.split("=");
                    if (lvContentArray.length > 1)
                        doc = Jsoup.connect(lvContentArray[1]).get();
                }
            }
        }
    
        // get page title
        return doc.title();
    
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    
    0 讨论(0)
  • 2020-12-09 22:47

    Update (case insensitive and pretty fault tolerant)

    • The content parsed (almost) according to spec
    • The first successfully parsed content meta data should be used

    public static void main(String[] args) throws Exception {
    
        URI uri = URI.create("http://www.amerisourcebergendrug.com");
    
        Document d = Jsoup.connect(uri.toString()).get();
    
        for (Element refresh : d.select("html head meta[http-equiv=refresh]")) {
    
            Matcher m = Pattern.compile("(?si)\\d+;\\s*url=(.+)|\\d+")
                               .matcher(refresh.attr("content"));
    
            // find the first one that is valid
            if (m.matches()) {
                if (m.group(1) != null)
                    d = Jsoup.connect(uri.resolve(m.group(1)).toString()).get();
                break;
            }
        }
    }
    

    Outputs correctly:

    http://www.amerisourcebergendrug.com/abcdrug/
    

    Old answer:

    Are you sure that it isn't working. For me:

    System.out.println(Jsoup.connect("http://www.ibm.com").get().baseUri());
    

    .. outputs http://www.ibm.com/us/en/ correctly..

    0 讨论(0)
提交回复
热议问题