问题
I am trying to get information from the following page: http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741
I need to get separate strings for each of these items:
- News Title
- News
- Analysis
Right now I am able to get information from the whole table using:
doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/" + playerId).timeout(30000).get();
Element title = doc.select("[id*=newsPage1]").first();
But the result of this is all of the articles run together.
Can anyone advise?
Thanks Josh
回答1:
You need to use more elaborate css selectors. Maybe something like:
public static void main(String[] args) {
Pattern pat = Pattern.compile("(.*)News\\:\\p{Zs}(.*)Analysis\\:\\p{Zs}(.*)", Pattern.UNICODE_CASE);
Document doc = null;
try {
doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741").userAgent("Mozilla").get();
} catch (IOException e1) {
e1.printStackTrace();
System.exit(0);
};
Elements titles = doc.select("table h3");
for (Element title : titles){
Element td = title.parent();
String innerTxt = td.text();
Matcher mat = pat.matcher(innerTxt);
if (mat.find()){
System.out.println("titel = " + mat.group(1));
System.out.println("news = " + mat.group(2));
System.out.println("analysis = " + mat.group(3));
}
}
}
I suggest you look into css selectors and the JSoup documentation.
来源:https://stackoverflow.com/questions/16415937/extracting-text-with-jsoup