Scrapy & captcha

陌路散爱 提交于 2019-12-05 17:35:34

The approach I am using and that usually works quite well looks like this (just a gist, you need to add your specific details):

Step 1 - getting the captcha url (and keeping the form's response for later)

def parse_page_with_captcha(response):
    captcha_url = response.xpath(...)
    data_for_later = {'captcha_form': response} # store the response for later use
    return Request(captcha_url, callback=self.parse_captcha_download, meta=data_for_later)

Step 2 - now scrapy will download the image and we have to process it properly in a scrapy callback

def parse_captcha_download(response):
    captcha_target_filename = 'filename.png'
    # save the image for processing
    i = Image.open(StringIO(response.body))
    i.save(captcha_target_filename)

    # process the captcha (OCR, or sending it to a decaptcha service, etc ...)
    captcha_text = solve_captcha(captcha_target_filename)

    # and now we have all the data we need for building the form request
    captcha_form = response.meta['captcha_form']

    return scrapy.FormRequest.from_response(captcha_form, formdata={'message': 'itttttttt', 
                               'security': captcha_text, 'name': 'fx',
                               'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                               }, callback=self.afterForm)

Important details

Captcha protected forms need some way to link captcha images with a particular user/client who saw and answered this captcha. This is usually done using cookie-based sessions or special parameters / image tokens hidden in the captcha form.

The scraper code must be careful not to destroy this link, otherwise it will answer some captcha but not the captcha it has to.

Why it is not working with the two examples Verz1Lka posted?

The urllib.urlretrieve approach works completely outside of scrapy. And while this is generally a bad idea (this won't use the benefits of scrapys scheduling etc), the major problem here is: this request will work completely outside of any session cookies, url parameters etc that the target site uses to track which captcha was sent to a particular browser.

The approach using the image pipeline on the other hand is playing nicely inside Scrapy's rules, but these image downloads are scheduled to be done at a later time and so the captcha download won't be available when it is needed.

You are downloading different captcha image because you are not using the same cookie you received when entered to form URL. Scrapy manages cookies by itself, so better you use scrapy to download images also. https://doc.scrapy.org/en/latest/topics/media-pipeline.html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!