问题
I am trying to retrieve tabular data from a html document stored in my local drive.I am stuck @ what to do after parsing i.e how to retrieve those nodes where we have data stored specifically.
<thead>
<tr>
<th></th>
<th data-field="position"><a>Rank</a></th>
<th data-field="name"><a>Brand</a></th>
<th data-field="brandValue"><a>Brand Value</a></th>
<th data-field="oneYearValueChange"><a>1-Yr Value Change</a></th>
<th data-field="revenue"><a>Brand Revenue</a></th>
<th data-field="advertising"><a>Company Advertising</a></th>
<th data-field="industry"><a>Industry</a></th>
</tr>
</thead>
This is the first pat of HTML I want to retrieve , this is the header part for my tabular data.
<tbody id="list-table-body">
<tr class="data">
<td class="image"><a href="http://www.forbes.com/companies/apple/" class="exit_trigger_set"><img src="./Forbes_files/apple_100x100.jpg" alt=""></a></td>
<td class="rank">#1 </td>
<td class="name"><a href="http://www.forbes.com/companies/apple/" class="exit_trigger_set">Apple</a></td>
<td>$145.3 B</td>
<td>17%</td>
<td>$182.3 B</td>
<td>$1.2 B</td>
<td>Technology</td>
</tr>
<tr class="data">
<td class="image"><a href="http://www.forbes.com/companies/microsoft/" class="exit_trigger_set"><img src="./Forbes_files/microsoft_100x100.jpg" alt=""></a></td>
<td class="rank">#2 </td>
<td class="name"><a href="http://www.forbes.com/companies/microsoft/" class="exit_trigger_set">Microsoft</a></td>
<td>$69.3 B</td>
<td>10%</td>
<td>$93.3 B</td>
<td>$2.3 B</td>
<td>Technology</td>
</tr>
<tr class="data">
<td class="image"><a href="http://www.forbes.com/companies/google/" class="exit_trigger_set"><img src="./Forbes_files/google_100x100.jpg" alt=""></a></td>
<td class="rank">#3 </td>
<td class="name"><a href="http://www.forbes.com/companies/google/" class="exit_trigger_set">Google</a></td>
<td>$65.6 B</td>
<td>16%</td>
<td>$61.8 B</td>
<td>$3 B</td>
<td>Technology</td>
</tr>
This portion of HTML contains the data i.e Rank , Name,and the other statistics. How can I retrieve both Header and the The data I showed in a dataframe ? Is it possible to retrieve images if I want to ?
Edit : So I looked a little harder and retrieved the data using XpathsAppy which contains class = data , I proceeded to remove "\t" and "\n" , which left me with a character array
fb1 <- htmlParse("forbes.html")
fb2 <- xpathSApply (fb1,"//tr[contains(@class,'data')]",xmlValue)
k3 <- gsub('\\t','',fb2)
k3 <- gsub('\\n',',',k3)
Now k3 is a character array with my data
> k3[1:5]
[1] ",#1 ,Apple,$145.3 B,17%,$182.3 B,$1.2 B,Technology,"
[2] ",#2 ,Microsoft,$69.3 B,10%,$93.3 B,$2.3 B,Technology,"
[3] ",#3 ,Google,$65.6 B,16%,$61.8 B,$3 B,Technology,"
[4] ",#4 ,Coca-Cola,$56 B,0%,$23.1 B,$3.5 B,Beverages,"
[5] ",#5 ,IBM,$49.8 B,4%,$92.8 B,$1.3 B,Technology,"
How do I convert it to a Data Frame ? Also I wanted the header at the top , but for this k3 charater array , header is at the bottom.
> tail(k3)
[1] ",#96 ,Lancome,$6.2 B,-2%,$4.5 B,-,Consumer Packaged Goods,"
[2] ",#97 ,KIA Motors,$6.2 B,-11%,$42.9 B,$992 M,Automotive,"
[3] ",#98 ,Sprite,$6.2 B,2%,$3.7 B,$3.5 B,Beverages,"
[4] ",#99 ,MTV,$6.2 B,6%,$3.4 B,$1 B,Media,"
[5] ",#100 ,Estee Lauder,$6.1 B,4%,$4.5 B,$2.8 B,Consumer Packaged Goods,"
[6] ",[RANK],[NAME],[BRAND_VALUE],[ONEYEARCHANGE],[REVENUE],[ADVERTISING],[INDUSTRY],
The Rank , Nmae part was supposed to be a header.
I would like any suggestions to improve my code or alternatives as well
来源:https://stackoverflow.com/questions/34686048/how-to-retrieve-data-from-the-following-html-document-structure-in-r