Itertools within web_crawler giving wrong triples

血红的双手。 提交于 2019-12-13 07:24:52

问题


I have written some code to parse name, link and price from craigslist. When I print the result, these are getting scraped as list. I tried like the pasted code below to get a workaround but it gives wrong triples specially when a value is none it gets the next available value from another triples and so on. For this reason, it is of no use in this case. Hope I'm gonna have any suggestion as to how I can get this accomplished whether it is Itertools or any other methods.

import requests
from lxml import html
from itertools import zip_longest

Page_link="http://bangalore.craigslist.co.in/search/rea?s=120"
def parsing_craigslist(url):
    response = requests.get(url)
    tree = html.fromstring(response.text)
    title = tree.xpath("//p[@class='result-info']//a[contains(concat(' ', @class, ' '), ' result-title ')]/text()")
    link = tree.xpath("//p[@class='result-info']//a[contains(concat(' ', @class, ' '), ' result-title ')]/@href")
    price = tree.xpath("//p[@class='result-info']//span[@class='result-price']/text()")
    for i,j,k in zip_longest(title,link,price,fillvalue=None):
        print(i,j,k)

parsing_craigslist(Page_link)

回答1:


My inclination is to avoid the difficulties that could arise in trying to match collections from two xpath queries using a zip by doing a depth-first search and then examining each entry, as here.

import requests
from lxml import html

page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
rows = tree.xpath('.//li[@class="result-row"]')
for n, row in enumerate(rows):
    price = row.xpath('.//a/span/text()')[0][1:]
    link = row.xpath('.//p/a')[0]
    title = link.text
    url = link.attrib['href']
    print ('--->', title)
    print (price, ':', url)


来源:https://stackoverflow.com/questions/44076626/itertools-within-web-crawler-giving-wrong-triples

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!