问题
Trying to use scrapy to scrape a website which encodes its post requests as "multipart/form-data" for some reason.
Is there a way to override scrapy's default behavior of posting using "application/x-www-form-urlencoded"?
It looks like the site is not responding to the spider because it wants its requests posted using "multipart/form-data".
Have tried multipart encoding the form variables but have seen using wireshark that scrapy still sets the header incorrectly irrespective of this encoding.
回答1:
Just use scrapy.http.FormRequest instead of scrapy.Request, passing the parameters in the formdata argument.
Sample code:
import scrapy
from scrapy.http import FormRequest
class MySpider(scrapy.Spider):
# ...
def start_requests(self):
yield FormRequest(some_post_url,
formdata=dict(param1='value1', param2='value2'))
Read more:
- Request usage examples
- FormRequest objects
回答2:
You could use this MultipartRequest:
Code:
from scrapy import Request
from StringIO import StringIO
import mimetypes
import random
class MultipartRequest(Request):
def __init__(self, *args, **kwargs):
formdata = kwargs.pop('formdata', None)
files = kwargs.pop('files', None)
kwargs['method'] = 'POST'
super(MultipartRequest, self).__init__(*args, **kwargs)
self._boundary = '-----------------------------{0}'.format(random.random() * 1e10)
if formdata or files:
buffer = StringIO()
if formdata:
self._write_formdata(formdata, buffer)
if files:
self._write_files(files, buffer)
self.headers['Content-Type'] = 'multipart/form-data; boundary={0}'.format(self._boundary)
self._set_body(buffer.getvalue())
def _write_formdata(self, formdata, buffer):
for key, value in formdata.iteritems():
buffer.write('--{0}\r\n'.format(self._boundary))
buffer.write('Content-Disposition: form-data; name="{0}"\r\n'.format(key))
buffer.write('\r\n')
buffer.write('{0}\r\n'.format(str(value).encode('utf-8')))
def _write_files(self, files, buffer):
for key, filedesc, fd in files:
buffer.write('--{0}\r\n'.format(self._boundary))
buffer.write('Content-Disposition: form-data; name="{0}"; filename="{1}"\r\n'.format(key, filedesc))
buffer.write('Content-Type: {0}\r\n'.format(self.get_content_type(filedesc)))
buffer.write('\r\n')
if isinstance(fd, basestring):
buffer.write(fd)
else:
buffer.write(fd.getvalue())
buffer.write('\r\n')
buffer.write('--{0}--\r\n'.format(self._boundary))
buffer.write('\r\n')
def get_content_type(self, filepath):
return mimetypes.guess_type(filepath)[0] or 'application/octet-stream'
回答3:
I have spent way more time on this than I would have liked to, so here is my rundown of the situation in scrapy
.
Background
multipart/form-data
content types have a particular kind of encoding that you need to follow. You can see an example of this inspecting the Network
tab in Developer tools
of any major browser, while sending this type of request. Here is an example of a multipart/form-data
request body/payload
-----------------------------9128252932315252835063017
Content-Disposition: form-data; name="username"
tibor.udvari
-----------------------------9128252932315252835063017
Content-Disposition: form-data; name="passwd"
secret
-----------------------------9128252932315252835063017
Content-Disposition: form-data; name="_mode"
edit
-----------------------------9128252932315252835063017--
You will also have to set the appropriate Content-Type
and Content-Length
in the headers.
Implementation
At the time of writing, Scrapy 1.4
does not have a facilitated way to send multipart/form-data
requests. You will have to construct a post request yourself.
First, you need to construct the request out of your data, I am using MultipartEncoder
class out of requests-toolbelt
to do so.
formdata = {'username': 'example', 'password': 'example'}
me = MultipartEncoder(fields=formdata)
me_boundary = me.boundary[2:] #need this in headers
me_length = me.len #need this in headers
me_body = me.to_string() #contains the request body
Next step is to create the request with valid headers
headers = {
'Content-Type': 'multipart/form-data; charset=utf-8; boundary=' + me_boundary,
'Content-Length': me_length
}
r = scrapy.Request(url='https://example.com', method='POST', body=me_body, headers=headers)
Sending this request should yield a valid response, if it is somehow malformed, you should get a server reponse saying something like "upload error".
Assuming that you are using scrapy shell
you may now send out the request
fetch(r)
Limitations
I have only tested this with text inputs, handling files might require more steps.
来源:https://stackoverflow.com/questions/26947131/python-scrapy-override-content-type-to-be-multipart-form-data-on-post-request