问题
I have written some code to parse name, link and price from craigslist. When I print the result, these are getting scraped as list. I tried like the pasted code below to get a workaround but it gives wrong triples specially when a value is none it gets the next available value from another triples and so on. For this reason, it is of no use in this case. Hope I'm gonna have any suggestion as to how I can get this accomplished whether it is Itertools or any other methods.
import requests
from lxml import html
from itertools import zip_longest
Page_link="http://bangalore.craigslist.co.in/search/rea?s=120"
def parsing_craigslist(url):
response = requests.get(url)
tree = html.fromstring(response.text)
title = tree.xpath("//p[@class='result-info']//a[contains(concat(' ', @class, ' '), ' result-title ')]/text()")
link = tree.xpath("//p[@class='result-info']//a[contains(concat(' ', @class, ' '), ' result-title ')]/@href")
price = tree.xpath("//p[@class='result-info']//span[@class='result-price']/text()")
for i,j,k in zip_longest(title,link,price,fillvalue=None):
print(i,j,k)
parsing_craigslist(Page_link)
回答1:
My inclination is to avoid the difficulties that could arise in trying to match collections from two xpath queries using a zip by doing a depth-first search and then examining each entry, as here.
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
rows = tree.xpath('.//li[@class="result-row"]')
for n, row in enumerate(rows):
price = row.xpath('.//a/span/text()')[0][1:]
link = row.xpath('.//p/a')[0]
title = link.text
url = link.attrib['href']
print ('--->', title)
print (price, ':', url)
来源:https://stackoverflow.com/questions/44076626/itertools-within-web-crawler-giving-wrong-triples