jsoup - extract text from wikipedia article

久未见 提交于 2019-11-30 19:49:58

问题


I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the text in http://en.wikipedia.org/wiki/Boston)?


回答1:


Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result

You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().




回答2:


Document doc = Jsoup.connect(url).get();
    Elements paragraphs = doc.select(".mw-content-ltr p");

    Element firstParagraph = paragraphs.first();
    Element lastParagraph = paragraphs.last();
    Element p;
    int i=1;
    p=firstParagraph;
    System.out.println(p.text());
    while (p!=lastParagraph){
        p=paragraphs.get(i);
        System.out.println(p.text());
        i++;
    } 



回答3:


Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000);

Element iamcontaningIDofintendedTAG= doc.select("#iamID") ;

System.out.println(iamcontaningIDofintendedTAG.toString());

OR

Elements iamcontaningCLASSofintendedTAG= doc.select(".iamCLASS") ;

System.out.println(iamcontaningCLASSofintendedTAG.toString());


来源:https://stackoverflow.com/questions/9151075/jsoup-extract-text-from-wikipedia-article

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!