jsoup - extract text from wikipedia article

前端 未结 3 1200
猫巷女王i
猫巷女王i 2021-01-06 12:58

I\'m writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the

相关标签:
3条回答
  • 2021-01-06 13:19
    Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select(".mw-content-ltr p");
    
        Element firstParagraph = paragraphs.first();
        Element lastParagraph = paragraphs.last();
        Element p;
        int i=1;
        p=firstParagraph;
        System.out.println(p.text());
        while (p!=lastParagraph){
            p=paragraphs.get(i);
            System.out.println(p.text());
            i++;
        } 
    
    0 讨论(0)
  • 2021-01-06 13:27
    Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);
    
    Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;
    
    System.out.println(iamcontaningIDofintendedTAG.toString());
    

    OR

    Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;
    
    System.out.println(iamcontaningCLASSofintendedTAG.toString());
    
    0 讨论(0)
  • 2021-01-06 13:32
    Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
    Element contentDiv = doc.select("div[id=content]").first();
    contentDiv.toString(); // The result
    

    You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().

    0 讨论(0)
提交回复
热议问题