Scrapy SgmlLinkExtractor question

妖精的绣舞 提交于 2019-12-01 00:06:38

问题


I am trying to make the SgmlLinkExtractor to work.

This is the signature:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

I am just using allow=()

So, I enter

rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

So, the initial url is 'http://www.whitecase.com/jacevedo/' and I am entering allow=('/aadler',) and expect that '/aadler/' will get scanned as well. But instead, the spider scans the initial url and then closes:

[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u'JD, ', u'Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, ', u'2005'])
[wcase] INFO: Closing domain (finished)

What am I doing wrong here?

Is there anyone here who used Scrapy successfully who can help me to finish this spider?

Thank you for the help.

I include the code for the spider below:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u

class NuSpider(CrawlSpider):
    domain_name = "wcase"
    start_urls = ['xxxxxx/jacevedo/']

    rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = NuItem()
        item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
        return item

SPIDER = NuSpider()

Note: SO will not let me post more than 1 url so substitute the initial url as necessary. Sorry about that.


回答1:


You are overriding the "parse" method it appears. "parse", is a private method in CrawlSpider used to follow links.




回答2:


if you check documentation a "Warning" is clearly written

"When writing crawl spider rules, avoid using parse as callback, since the Crawl Spider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work."

url for verification




回答3:


allow=(r'/aadler/', ...




回答4:


You are missing comma after first element for "rules" to be a tuple..

rules = (Rule(SgmlLinkExtractor(allow=('/careers/n.\w+', )), callback='parse', follow=True),)


来源:https://stackoverflow.com/questions/1809817/scrapy-sgmllinkextractor-question

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!