问题
Any idea on how to scrape a web page with multiple tables? I am connecting to the web page
This is one table but on the same web page there are multiple tables
I also cant figure out how to read the table...
XML:
<p><a href="/fantasy_news/feature/?ID=49818"><strong>Top 300 Overall Fantasy Rankings</strong></a></p>
<div class="storyStats">
<table>
<thead>
<tr>
<th>RANK</th>
<th>CENTRES</th>
<th>TEAM</th>
<th>POS</th>
<th>GP</th>
<th>G</th>
<th>A</th>
<th>PTS</th>
<th>+/-</th>
<th>PIM</th>
<th>PPP</th>
</tr>
</thead>
<tbody>
<tr class="bg1">
<td>1.</td>
<td><a href="/nhl/teams/players/?name=steven+stamkos">Steven Stamkos</a></td>
<td>Tampa Bay</td>
<td>C</td>
<td align="right">81</td>
<td align="right">50</td>
<td align="right">51</td>
<td align="right">101</td>
<td align="right">-2</td>
<td align="right">56</td>
<td align="right">38</td>
</tr>
Iterator<Element> trSIter = doc.select("table")
.iterator();
while (trSIter.hasNext()) {
Element trEl = trSIter.next().child(0);
Elements tdEls = trEl.children();
Iterator<Element> tdIter = tdEls.select("tr").iterator();
System.out.println("><1><><"+tdIter);
boolean firstRow = true;
while (tdIter.hasNext()) {
Element tr = (Element) tdIter.next();
while (tdIter.hasNext()) {
int tdCount = 1;
Element tdEl = tdIter.next();
//name = tdEl.getElementsByClass("playertablePlayerName").get(0).text();
Elements tdsEls = tdEl.select("td");
System.out.println("><2><><"+tdsEls);
Iterator<Element> columnIt = tdsEls.iterator();
while (columnIt.hasNext()) {
Element column = columnIt.next();
switch (tdCount++) {
case 1:
name =column.select("a").first().text();
break;
case 2:
stat2 = Double.parseDouble(column.text());
break;
case 3:
stat3 = Double.parseDouble(column.text());
break;
case 4:
stat4 = Double.parseDouble(column.text());
break;
case 5:
stat5 = Double.parseDouble(column.text());
break;
case 6:
stat6 = Double.parseDouble(column.text());
break;
case 7:
stat7 = Double.parseDouble(column.text());
break;
case 8:
stat8 = Double.parseDouble(column.text());
break;
回答1:
This should get you started. Each table has a blank record you will have to account for. You will also have to figure out which stats you want and where they are in the table. You get the stats with tds.get(). Let me know how it works for you.
Document doc = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815").get();
for (Element table : doc.select("div.storyStats").select("table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 0) {
System.out.println(tds.get(1).text() + ":" + tds.get(5).text());
}
}
}
回答2:
With the below code, it seems there is no problem in parsing the tables from the HTML.
public class JsoupActivity extends Activity {
Document doc;
myHttpGet _myGet;
@Override
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.main);
final TextView tv = (TextView)findViewById(R.id.tv1);
_myGet = new myHttpGet();
try {
doc = _myGet.doHttpGet();
Elements tdsEls = doc.getElementsByClass("storyStats");
//tv.setText(tdsEls.get(0).child(0).text());
tv.setText(String.valueOf(tdsEls.first().children().size()));
} catch (Exception e) {
e.printStackTrace();
}
}
private class myHttpGet {
Document myDom;
Connection myConnection;
Response myResponse;
public Document doHttpGet() {
myConnection = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815");
try {
myResponse = myConnection.execute();
try {
myDom = myResponse.parse();
return myDom;
} catch (IOException e) {
Log.e("napster","Parse Error");
}
} catch (IOException e) {
Log.e("napster","HTTP Error");
}
return myDom;
}
}
}
The code can show 5 in textView which is the number of tables you have in that HTML under the class storyStats. If you have to go ahead parsing the contents of the tables, you can assign the tables into another Elements object and go ahead parsing it.
Elements es = tdsEls.first().children();
Anderson's answer shows how to parse it for data. Hope that helps.
来源:https://stackoverflow.com/questions/9190793/how-to-parse-a-page-with-multiple-tables