问题
I want to get costs of mobile phones from this site http://www.univercell.in/buy/SMART
i tried to test it so i used: scarpy shell http://www.univercell.in/control/AjaxCategoryDetail?productCategoryId=PRO-SMART&category_id=PRO-SMART&attrName=&min=&max=&sortSearchPrice=&VIEW_INDEX=2&VIEW_SIZE=15&serachupload=&sortupload=
But I am not able to connect to this site. As the page is loaded using ajax I found out the start_url using firebug. Can any one suggest me where I am going wrong
回答1:
How about writing a JavaScript script to perform the actions that are already performed when clicking the page number and then simply dump the XML that is returned from the server. I mean try to make the calls to the server as if the site was hosted on your Desktop!
The JavaScript function called when you hit a number is paginateList('numberOfPage') where numberOfPage is the page you want to visit.
The body of the function is
function paginateList(viewIndex) {
var productCategoryId = document.pageSelect.category_id.value;
var viewSize = document.pageSelect.VIEW_SIZE.value;
var min = "";
if(document.pageSelect.min!=null)
min = document.pageSelect.min.value;
var max = "";
if(document.pageSelect.max!=null)
max = document.pageSelect.max.value;
var attrName = "";
if(document.pageSelect.attrName!=null)
attrName = document.pageSelect.attrName.value;
if(attrName==""){
var commaAttr=document.getElementById('commaAttr');
attrName=commaAttr.value;
}
var limitView = 'true';
var sortSearchPrice = "";
if(document.pageSelect.sortSearchPrice!=null)
sortSearchPrice = document.pageSelect.sortSearchPrice.value;
var url2="/control/AjaxCategoryDetail?productCategoryId="+productCategoryId+"&category_id="+productCategoryId+"&attrName="+attrName+"&min="+min+"&max="+max+"&sortSearchPrice="+sortSearchPrice+"&VIEW_INDEX="+viewIndex+"&VIEW_SIZE="+viewSize+"&serachupload=&sortupload=";
pleaseWait('Y');
jQuery.ajax({url: url2,
data: null,
type: 'post',
async: false,
success: function(data) {
$('#searchResult').html(data);
pleaseWait('N');
},
error: function(data) {
alert("Error during product searching");
}
});
Use these to get the data from each page recursively.
Hope it helps!
回答2:
Here's your spider:
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class UnivercellItem(Item):
vendor = Field()
model = Field()
price = Field()
BASE_URL = "http://www.univercell.in/control/AjaxCategoryDetail?productCategoryId=PRO-SMART&category_id=PRO-SMART&attrName=&min=&max=&sortSearchPrice=&VIEW_INDEX=%s&VIEW_SIZE=15&serachupload=&sortupload="
class UnivercellSpider(BaseSpider):
name = "univercell_spider"
allowed_domains = ["www.univercell.in"]
start_urls = [BASE_URL % index for index in range(1, 21)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
mobiles = hxs.select("//div[@class='productsummary']")
print mobiles
for mobile in mobiles:
item = UnivercellItem()
item['vendor'] = mobile.select('.//div[1]/div/text()').extract()[0].strip()
item['model'] = mobile.select('.//div[3]/div[1]/a/text()').extract()[0].strip()
item['price'] = mobile.select('.//span[@class="regularPrice"]/span/text()').extract()[0].strip()
yield item
Save it to spider.py and run via scrapy runspider spider.py -o output.json. Then in output.json you will see:
{"model": "T375", "vendor": "LG", "price": "Special Price Click Here"}
{"model": "P725 Optimus 3D Max", "vendor": "LG", "price": "Special Price Click Here"}
{"model": "P705 Optimus L7", "vendor": "LG", "price": "Special Price Click Here"}
{"model": "9320 Curve", "vendor": "Blackberry", "price": "Special Price Click Here"}
{"model": "Xperia Sola", "vendor": "Sony", "price": "Rs.14,500.00"}
{"model": "Xperia U", "vendor": "Sony", "price": "Special Price Click Here"}
{"model": "Lumia 610", "vendor": "Nokia", "price": "Special Price Click Here"}
...
Hope that helps.
来源:https://stackoverflow.com/questions/16796747/how-to-retrieve-data-the-page-is-loaded-using-ajax