How to parse a page with multiple tables

こ雲淡風輕ζ 提交于 2020-01-03 04:52:06

问题


Any idea on how to scrape a web page with multiple tables? I am connecting to the web page

This is one table but on the same web page there are multiple tables

I also cant figure out how to read the table...

XML:

    <p><a href="/fantasy_news/feature/?ID=49818"><strong>Top 300 Overall Fantasy Rankings</strong></a></p> 
<div class="storyStats"> 
<table> 
<thead> 
<tr> 
<th>RANK</th> 
<th>CENTRES</th> 
<th>TEAM</th> 
<th>POS</th> 
<th>GP</th> 
<th>G</th> 
<th>A</th> 
<th>PTS</th> 
<th>+/-</th> 
<th>PIM</th> 
<th>PPP</th> 
</tr> 
</thead> 
<tbody> 
<tr class="bg1"> 
<td>1.</td> 
<td><a href="/nhl/teams/players/?name=steven+stamkos">Steven&nbsp;Stamkos</a></td> 

<td>Tampa Bay</td> 
<td>C</td> 
<td align="right">81</td> 
<td align="right">50</td> 
<td align="right">51</td> 
<td align="right">101</td> 
<td align="right">-2</td> 
<td align="right">56</td> 
<td align="right">38</td> 
</tr> 


Iterator<Element> trSIter = doc.select("table")
            .iterator();
    while (trSIter.hasNext()) {
        Element trEl = trSIter.next().child(0);
        Elements tdEls = trEl.children();
        Iterator<Element> tdIter = tdEls.select("tr").iterator();
        System.out.println("><1><><"+tdIter);
        boolean firstRow = true;
        while (tdIter.hasNext()) {

            Element tr = (Element) tdIter.next();


            while (tdIter.hasNext()) {
                int tdCount = 1;
                Element tdEl = tdIter.next();
                //name = tdEl.getElementsByClass("playertablePlayerName").get(0).text();

                Elements tdsEls = tdEl.select("td");
                System.out.println("><2><><"+tdsEls);
                Iterator<Element> columnIt = tdsEls.iterator();

                while (columnIt.hasNext()) {

                    Element column = columnIt.next();
                    switch (tdCount++) {
                    case 1:
                        name =column.select("a").first().text();

                        break;
                    case 2:
                        stat2 = Double.parseDouble(column.text());
                        break;
                    case 3:
                        stat3 = Double.parseDouble(column.text());
                        break;
                    case 4:
                        stat4 = Double.parseDouble(column.text());
                        break;
                    case 5:
                        stat5 = Double.parseDouble(column.text());
                        break;
                    case 6:
                        stat6 = Double.parseDouble(column.text());
                        break;
                    case 7:
                        stat7 = Double.parseDouble(column.text());
                        break;
                    case 8:
                        stat8 = Double.parseDouble(column.text());
                        break;

回答1:


This should get you started. Each table has a blank record you will have to account for. You will also have to figure out which stats you want and where they are in the table. You get the stats with tds.get(). Let me know how it works for you.

    Document doc = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815").get();

    for (Element table : doc.select("div.storyStats").select("table")) {
        for (Element row : table.select("tr")) {
            Elements tds = row.select("td");
            if (tds.size() > 0) {
                System.out.println(tds.get(1).text() + ":" + tds.get(5).text());
            }
        }
    }



回答2:


With the below code, it seems there is no problem in parsing the tables from the HTML.

public class JsoupActivity extends Activity {
    Document doc;
    myHttpGet _myGet;
    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);
        final TextView tv = (TextView)findViewById(R.id.tv1);
        _myGet = new myHttpGet();
        try {
            doc = _myGet.doHttpGet();
            Elements tdsEls = doc.getElementsByClass("storyStats");
            //tv.setText(tdsEls.get(0).child(0).text());
            tv.setText(String.valueOf(tdsEls.first().children().size()));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private class myHttpGet {
        Document myDom;
        Connection myConnection;
        Response myResponse;
        public Document doHttpGet() {
            myConnection = Jsoup.connect("http://www.tsn.ca/fantasy_news/feature/?ID=49815");
            try {
                myResponse = myConnection.execute();
                try {
                    myDom = myResponse.parse();
                    return myDom;
                } catch (IOException e) {
                    Log.e("napster","Parse Error");
                }
            } catch (IOException e) {
                Log.e("napster","HTTP Error");
            }
            return myDom;
        }
    }

}

The code can show 5 in textView which is the number of tables you have in that HTML under the class storyStats. If you have to go ahead parsing the contents of the tables, you can assign the tables into another Elements object and go ahead parsing it.

Elements es = tdsEls.first().children();

Anderson's answer shows how to parse it for data. Hope that helps.



来源:https://stackoverflow.com/questions/9190793/how-to-parse-a-page-with-multiple-tables

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!