Scrapy & captcha | 易学教程

问题

I use scrapy for submit form in site https://www.barefootstudent.com/jobs (any links into page, etc http://www.barefootstudent.com/los_angeles/jobs/full_time/full_time_nanny_needed_in_venice_217021)

My scapy bot successfully log in but i can not avoid captcha. For form submit i use scrapy.FormRequest.from_reponse

frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)

    yield frq

i want load captcha image from this page, and manual input into script runtime. etc

captcha = raw_input("put captcha in manually>")

I try

 urllib.urlretrieve(captcha, "./captcha.jpg")

But this method load incorrect captcha (site reject my input). I try call urllib.urlretieve repeatedly in one run script and every time he returns the different captchas :(

After that i tried use ImagePipeline. But my problem is that return item (downloading image) occurs only after the function has finished executed, even if I use yeild.

 item = BfsItem()
 item['image_urls'] = [captcha]
 yield item
 captcha = raw_input("put captcha in manually>")  
 frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)
 yield frq

At that moment, when my script request input, the picture is not download!

How i can modify my script and can call FormRequest after manual input captcha?

Thank you very much!

回答1:

The approach I am using and that usually works quite well looks like this (just a gist, you need to add your specific details):

Step 1 - getting the captcha url (and keeping the form's response for later)

def parse_page_with_captcha(response):
    captcha_url = response.xpath(...)
    data_for_later = {'captcha_form': response} # store the response for later use
    return Request(captcha_url, callback=self.parse_captcha_download, meta=data_for_later)

Step 2 - now scrapy will download the image and we have to process it properly in a scrapy callback

def parse_captcha_download(response):
    captcha_target_filename = 'filename.png'
    # save the image for processing
    i = Image.open(StringIO(response.body))
    i.save(captcha_target_filename)

    # process the captcha (OCR, or sending it to a decaptcha service, etc ...)
    captcha_text = solve_captcha(captcha_target_filename)

    # and now we have all the data we need for building the form request
    captcha_form = response.meta['captcha_form']

    return scrapy.FormRequest.from_response(captcha_form, formdata={'message': 'itttttttt', 
                               'security': captcha_text, 'name': 'fx',
                               'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                               }, callback=self.afterForm)

Important details

Captcha protected forms need some way to link captcha images with a particular user/client who saw and answered this captcha. This is usually done using cookie-based sessions or special parameters / image tokens hidden in the captcha form.

The scraper code must be careful not to destroy this link, otherwise it will answer some captcha but not the captcha it has to.

Why it is not working with the two examples Verz1Lka posted?

The urllib.urlretrieve approach works completely outside of scrapy. And while this is generally a bad idea (this won't use the benefits of scrapys scheduling etc), the major problem here is: this request will work completely outside of any session cookies, url parameters etc that the target site uses to track which captcha was sent to a particular browser.

The approach using the image pipeline on the other hand is playing nicely inside Scrapy's rules, but these image downloads are scheduled to be done at a later time and so the captcha download won't be available when it is needed.

回答2:

You are downloading different captcha image because you are not using the same cookie you received when entered to form URL. Scrapy manages cookies by itself, so better you use scrapy to download images also. https://doc.scrapy.org/en/latest/topics/media-pipeline.html

来源：https://stackoverflow.com/questions/27948326/scrapy-captcha

标签

python

scrapy

captcha