How to Retrieve data from the following HTML document structure in R

问题

I am trying to retrieve tabular data from a html document stored in my local drive.I am stuck @ what to do after parsing i.e how to retrieve those nodes where we have data stored specifically.

<thead>
        <tr>
                    <th></th> 


                    <th data-field="position"><a>Rank</a></th>


                    <th data-field="name"><a>Brand</a></th>


                    <th data-field="brandValue"><a>Brand Value</a></th>


                    <th data-field="oneYearValueChange"><a>1-Yr Value Change</a></th>


                    <th data-field="revenue"><a>Brand Revenue</a></th>


                    <th data-field="advertising"><a>Company Advertising</a></th>


                    <th data-field="industry"><a>Industry</a></th>

        </tr>
    </thead>

This is the first pat of HTML I want to retrieve , this is the header part for my tabular data.

<tbody id="list-table-body">

    <tr class="data">
            <td class="image"><a href="http://www.forbes.com/companies/apple/" class="exit_trigger_set"><img src="./Forbes_files/apple_100x100.jpg" alt=""></a></td>
            <td class="rank">#1 </td>
            <td class="name"><a href="http://www.forbes.com/companies/apple/" class="exit_trigger_set">Apple</a></td>
            <td>$145.3 B</td>
            <td>17%</td>
            <td>$182.3 B</td>
            <td>$1.2 B</td>
            <td>Technology</td>
    </tr>

    <tr class="data">
            <td class="image"><a href="http://www.forbes.com/companies/microsoft/" class="exit_trigger_set"><img src="./Forbes_files/microsoft_100x100.jpg" alt=""></a></td>
            <td class="rank">#2 </td>
            <td class="name"><a href="http://www.forbes.com/companies/microsoft/" class="exit_trigger_set">Microsoft</a></td>
            <td>$69.3 B</td>
            <td>10%</td>
            <td>$93.3 B</td>
            <td>$2.3 B</td>
            <td>Technology</td>
    </tr>

    <tr class="data">
            <td class="image"><a href="http://www.forbes.com/companies/google/" class="exit_trigger_set"><img src="./Forbes_files/google_100x100.jpg" alt=""></a></td>
            <td class="rank">#3 </td>
            <td class="name"><a href="http://www.forbes.com/companies/google/" class="exit_trigger_set">Google</a></td>
            <td>$65.6 B</td>
            <td>16%</td>
            <td>$61.8 B</td>
            <td>$3 B</td>
            <td>Technology</td>
    </tr>

This portion of HTML contains the data i.e Rank , Name,and the other statistics. How can I retrieve both Header and the The data I showed in a dataframe ? Is it possible to retrieve images if I want to ?

Edit : So I looked a little harder and retrieved the data using XpathsAppy which contains class = data , I proceeded to remove "\t" and "\n" , which left me with a character array

fb1 <- htmlParse("forbes.html")
fb2 <- xpathSApply (fb1,"//tr[contains(@class,'data')]",xmlValue) 
k3 <- gsub('\\t','',fb2)
k3 <- gsub('\\n',',',k3)

Now k3 is a character array with my data

> k3[1:5]
[1] ",#1 ,Apple,$145.3 B,17%,$182.3 B,$1.2 B,Technology,"  
[2] ",#2 ,Microsoft,$69.3 B,10%,$93.3 B,$2.3 B,Technology,"
[3] ",#3 ,Google,$65.6 B,16%,$61.8 B,$3 B,Technology,"     
[4] ",#4 ,Coca-Cola,$56 B,0%,$23.1 B,$3.5 B,Beverages,"    
[5] ",#5 ,IBM,$49.8 B,4%,$92.8 B,$1.3 B,Technology,"

How do I convert it to a Data Frame ? Also I wanted the header at the top , but for this k3 charater array , header is at the bottom.

> tail(k3)
[1] ",#96 ,Lancome,$6.2 B,-2%,$4.5 B,-,Consumer Packaged Goods,"                      
[2] ",#97 ,KIA Motors,$6.2 B,-11%,$42.9 B,$992 M,Automotive,"                         
[3] ",#98 ,Sprite,$6.2 B,2%,$3.7 B,$3.5 B,Beverages,"                                 
[4] ",#99 ,MTV,$6.2 B,6%,$3.4 B,$1 B,Media,"                                          
[5] ",#100 ,Estee Lauder,$6.1 B,4%,$4.5 B,$2.8 B,Consumer Packaged Goods,"            
[6] ",[RANK],[NAME],[BRAND_VALUE],[ONEYEARCHANGE],[REVENUE],[ADVERTISING],[INDUSTRY],

The Rank , Nmae part was supposed to be a header.

I would like any suggestions to improve my code or alternatives as well

来源：https://stackoverflow.com/questions/34686048/how-to-retrieve-data-from-the-following-html-document-structure-in-r

标签

html

dom

web-scraping

html-parsing