Extracting text with Jsoup

 ̄綄美尐妖づ 提交于 2020-01-25 11:44:06

问题


I am trying to get information from the following page: http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741

I need to get separate strings for each of these items:

  1. News Title
  2. News
  3. Analysis

Right now I am able to get information from the whole table using:

 doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/" + playerId).timeout(30000).get();
 Element title = doc.select("[id*=newsPage1]").first(); 

But the result of this is all of the articles run together.

Can anyone advise?

Thanks Josh


回答1:


You need to use more elaborate css selectors. Maybe something like:

public static void main(String[] args) {
  Pattern pat = Pattern.compile("(.*)News\\:\\p{Zs}(.*)Analysis\\:\\p{Zs}(.*)", Pattern.UNICODE_CASE);
  Document doc = null;
  try {
    doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741").userAgent("Mozilla").get();
  } catch (IOException e1) {
    e1.printStackTrace();
    System.exit(0);
  };

  Elements titles = doc.select("table h3");
  for (Element title : titles){
    Element td = title.parent();
    String innerTxt = td.text();
    Matcher mat = pat.matcher(innerTxt);
    if (mat.find()){
      System.out.println("titel = " + mat.group(1));
      System.out.println("news = " + mat.group(2));
      System.out.println("analysis = " + mat.group(3));
    }
  } 
}

I suggest you look into css selectors and the JSoup documentation.



来源:https://stackoverflow.com/questions/16415937/extracting-text-with-jsoup

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!