We\'ve been using scrapy-splash middleware to pass the scraped HTML source through the Splash
javascript engine running inside a docker container.
If we
For the windows users, who use Docker Toolbox:
Change the single inverted comma with double inverted comma for preventing the invalid hostname:http
error.
change the localhost to the docker ip address which is below the whale logo. for me it was 192.168.99.100
.
Finally i got this:
scrapy shell "http://192.168.99.100:8050/render.html?url="https://samplewebsite.com/category/banking-insurance-financial-services/""
You can run scrapy shell
without arguments inside a configured Scrapy project, then create req = scrapy_splash.SplashRequest(url, ...)
and call fetch(req)
.
just wrap the url you want to shell to in splash http api.
So you would want something like:
scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'
where localhost:port
is where your splash service is running
url
is url you want to crawl and dont forget to urlquote it!
render.html
is one of the possible http api endpoints, returns redered html page in this case
timeout
time in seconds for timeout
wait
time in seconds to wait for javascript to execute before reading/saving the html.