scrapy

scrape multiple addresses from multiple files in scrapy

独自空忆成欢 提交于 2021-01-28 08:32:11
问题 I have some JSON file in a directory. In any of this files, there are some information I need. the first property i need is the links list for "start_urls" in scrapy. every file is for a different process, so its output must be separate. So I can't put all the links in all the json files into start_urls and run them together. i have to run the spider for everyfile. how can i do this? here is my code so far: import scrapy from os import listdir from os.path import isfile, join import json

Scraping a website that contains _dopostback method written with URL hidden

雨燕双飞 提交于 2021-01-28 07:27:45
问题 I am new to Scrapy . I am trying to scrape this website in asp, that contains various profiles. It has a total of 259 pages. To navigate over the pages, there are several links at the bottom like 1,2,3....and so on.These links use _dopostback href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$RepeaterPaging$ctl 00 $Pagingbtn','')" For each page only the bold text changes. How do I use scrapy to iterate over several pages and extract the information? the form data is as follows: _

Running Scrapy multiple times in the same process

大兔子大兔子 提交于 2021-01-28 07:04:24
问题 I have a list of URLs. I want to crawl each of these. Please note adding this array as start_urls is not the behavior I'm looking for. I would like this to run one by one in separate crawl sessions. I want to run Scrapy multiple times in the same process I want to run Scrapy as a script, as covered in Common Practices, and not from the CLI. The following code is a full, broken, copy-pastable example. It basically tries to loop through a list of URLs and start the crawler on each of them. This

Problem with __VIEWSTATE, __EVENTVALIDATION, __EVENTTARGET and scrapy & splash

别说谁变了你拦得住时间么 提交于 2021-01-28 06:04:35
问题 How do i handle __VIEWSTATE, __EVENTVALIDATION, __EVENTTARGET with scrapy/splash? I tried with return FormRequest.from_response(response, [...] '__VIEWSTATE': response.css( 'input#__VIEWSTATE::attr(value)').extract_first(), But this does not work. 回答1: You'll need to use a dict as the formdata keyword arg. (I'd also recommend extracting into variables first for readability) def parse(self, response): vs = response.css('input#__VIEWSTATE::attr(value)').extract_first() ev = # another extraction

scrapy is there a way to print json file without using -o -t parameters

情到浓时终转凉″ 提交于 2021-01-28 05:55:45
问题 I am usually calling my spider like this: scrapy crawl Spider -o fileName -t json and I got the correct data printed in the fileName file as json formated. Now I want to call my spider like this: scrapy crawl Spider my question is there a way to print the output to a file without using the -o -t parameters? 回答1: Yes it can be done. add this to your settings FEED_EXPORTERS = { 'jsonlines': 'scrapy.contrib.exporter.JsonLinesItemExporter', } FEED_FORMAT = 'jsonlines' FEED_URI = "NAME_OF_FILE

500 Internal server error scrapy

你。 提交于 2021-01-28 05:26:33
问题 I am using scrapy to crawl a product website which over 4 million products. However after crawling around 50k products it starts throwing 500 HTTP error. I have set Auto throttling to false as after enabling its very slow and will take around 20-25 days to complete the scraping. I think the server starts blocking the crawler temporarily after sometime. Any solutions what can be done ? I am using sitemap crawler - I want to extract some information from the url itself if the server is not

Scrapy: saving cookies between invocations

久未见 提交于 2021-01-28 03:13:49
问题 Is there a way to preserve cookies between invocations of a scrapy crawler? The purpose - the site requires log in, and then maintains the session via cookies. I'd rather reuse the session than re-login every time. 回答1: Please refer to the docs about cookies. FAQ entry CookiesMiddleware Alternatively you can send Request objects with cookies managed by yourself (you can read cookies from Response objects' head). About Request and Response objects 来源: https://stackoverflow.com/questions

Scrapy gets stuck crawling a long list of urls

≡放荡痞女 提交于 2021-01-28 02:06:51
问题 I am scraping a large list of urls (1000-ish) and after a set time the crawler gets stuck with crawling 0 pages/min. The problem always occurs at the same spot when crawling. The list of urls is retrieved from a MySQL database. I am fairly new to python and scrapy so I don't know where to start debugging, and I fear that due to my inexperience the code itself is also a bit of a mess. Any pointers to where the issue lies are appreciated. I used to retrieve the entire list of urls in one go,

How check if website support http, htts and www prefix with scrapy

假装没事ソ 提交于 2021-01-28 01:14:02
问题 I am using scrapy to check, if some website works fine, when I use http://example.com , https://example.com or http://www.example.com . When I create scrapy request, it works fine. for example, on my page1.com , it is always redirected to https:// . I need to get this information as return value, or is there better way how to get this information using scrapy? class myspider(scrapy.Spider): name = 'superspider' start_urls = [ "https://page1.com/" ] def start_requests(self): for url in self

Scrapy Pipeline doesn't insert into MySQL

巧了我就是萌 提交于 2021-01-27 17:55:39
问题 I'm trying to build a small app for a university project with Scrapy. The spider is scraping the items, but my pipeline is not inserting data into mysql database. In order to test whether the pipeline is not working or the pymysl implementation is not working I wrote a test script: Code Start #!/usr/bin/python3 import pymysql str1 = "hey" str2 = "there" str3 = "little" str4 = "script" db = pymysql.connect("localhost","root","**********","stromtarife" ) cursor = db.cursor() cursor.execute(