Web Scraping (in R?)

问题

I want to get the names of the companies in the middle column of this page (written in bold in blue), as well as the location indicator of the person who is registering the complaint (e.g. "India, Delhi", written in green). Basically, I want a table (or data frame) with two columns, one for company, and the other for location. Any ideas?

回答1:

You can easily do this using the XML package in R. Here is the code

url = "http://www.consumercomplaints.in/bysubcategory/mobile-service-providers/page/1.html"
doc = htmlTreeParse(url, useInternalNodes = T)

profiles = xpathSApply(doc, "//a[contains(@href, 'profile')]", xmlValue)
profiles = profiles[!(1:length(profiles) %% 2)]

states   = xpathSApply(doc, "//a[contains(@href, 'bystate')]", xmlValue)

回答2:

This to match titles in blue bold, the trick is to open the source code of page and look what is before and after what are you looking for, then you use regex.

preg_match('/>[a-zA-Z0-9]+<\/a><\/h4><\/td>/', $str, $matches);
for($i = 0;$i<sizeof($matches);$i++)
 echo $matches[$i];

You may check this.

来源：https://stackoverflow.com/questions/5830705/web-scraping-in-r

标签

html-parsing

web-scraping

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!