问题

Trying to use scrapy to scrape a website which encodes its post requests as "multipart/form-data" for some reason.

Is there a way to override scrapy's default behavior of posting using "application/x-www-form-urlencoded"?

It looks like the site is not responding to the spider because it wants its requests posted using "multipart/form-data".

Have tried multipart encoding the form variables but have seen using wireshark that scrapy still sets the header incorrectly irrespective of this encoding.

回答1:

Just use scrapy.http.FormRequest instead of scrapy.Request, passing the parameters in the formdata argument.

Sample code:

import scrapy
from scrapy.http import FormRequest

class MySpider(scrapy.Spider):
    # ...
    def start_requests(self):
        yield FormRequest(some_post_url,
                          formdata=dict(param1='value1', param2='value2'))

Request usage examples
FormRequest objects

回答2:

You could use this MultipartRequest:

Code:

from scrapy import Request
from StringIO import StringIO

import mimetypes
import random


class MultipartRequest(Request):

    def __init__(self, *args, **kwargs):
        formdata = kwargs.pop('formdata', None)
        files = kwargs.pop('files', None)

        kwargs['method'] = 'POST'

        super(MultipartRequest, self).__init__(*args, **kwargs)

        self._boundary = '-----------------------------{0}'.format(random.random() * 1e10)

        if formdata or files:
            buffer = StringIO()

            if formdata:
                self._write_formdata(formdata, buffer)

            if files:
                self._write_files(files, buffer)

            self.headers['Content-Type'] = 'multipart/form-data; boundary={0}'.format(self._boundary)
            self._set_body(buffer.getvalue())


    def _write_formdata(self, formdata, buffer):
        for key, value in formdata.iteritems():
            buffer.write('--{0}\r\n'.format(self._boundary))
            buffer.write('Content-Disposition: form-data; name="{0}"\r\n'.format(key))
            buffer.write('\r\n')
            buffer.write('{0}\r\n'.format(str(value).encode('utf-8')))

    def _write_files(self, files, buffer):
        for key, filedesc, fd in files:
            buffer.write('--{0}\r\n'.format(self._boundary))
            buffer.write('Content-Disposition: form-data; name="{0}"; filename="{1}"\r\n'.format(key, filedesc))
            buffer.write('Content-Type: {0}\r\n'.format(self.get_content_type(filedesc)))
            buffer.write('\r\n')

            if isinstance(fd, basestring):
                buffer.write(fd)
            else:
                buffer.write(fd.getvalue())

            buffer.write('\r\n')
            buffer.write('--{0}--\r\n'.format(self._boundary))
            buffer.write('\r\n')


    def get_content_type(self, filepath):
        return mimetypes.guess_type(filepath)[0] or 'application/octet-stream'

回答3:

I have spent way more time on this than I would have liked to, so here is my rundown of the situation in scrapy.

Background

multipart/form-data content types have a particular kind of encoding that you need to follow. You can see an example of this inspecting the Network tab in Developer tools of any major browser, while sending this type of request. Here is an example of a multipart/form-data request body/payload

-----------------------------9128252932315252835063017
Content-Disposition: form-data; name="username"

tibor.udvari
-----------------------------9128252932315252835063017
Content-Disposition: form-data; name="passwd"

secret
-----------------------------9128252932315252835063017
Content-Disposition: form-data; name="_mode"

edit
-----------------------------9128252932315252835063017--

You will also have to set the appropriate Content-Type and Content-Length in the headers.

Implementation

At the time of writing, Scrapy 1.4 does not have a facilitated way to send multipart/form-data requests. You will have to construct a post request yourself.

First, you need to construct the request out of your data, I am using MultipartEncoder class out of requests-toolbelt to do so.

formdata = {'username': 'example', 'password': 'example'}
me = MultipartEncoder(fields=formdata)
me_boundary = me.boundary[2:]  #need this in headers
me_length = me.len             #need this in headers
me_body = me.to_string()       #contains the request body

Next step is to create the request with valid headers

headers = {
        'Content-Type': 'multipart/form-data; charset=utf-8; boundary=' + me_boundary,
        'Content-Length': me_length
}
r = scrapy.Request(url='https://example.com', method='POST', body=me_body, headers=headers)

Sending this request should yield a valid response, if it is somehow malformed, you should get a server reponse saying something like "upload error".

Assuming that you are using scrapy shell you may now send out the request

fetch(r)

Limitations

I have only tested this with text inputs, handling files might require more steps.

来源：https://stackoverflow.com/questions/26947131/python-scrapy-override-content-type-to-be-multipart-form-data-on-post-request

标签

python

scrapy

multipart

Python Scrapy override content type to be multipart/form-data on post Request

问题