web-crawler

The scrapy-redis program does not close automatically

元气小坏坏 提交于 2019-12-11 17:31:52
问题 Scrapy-redis framework, redis stored xxx: requests have been crawled finished, but the program is still running, how to automatically stop the program, rather than has been running? The running code: 2017-08-07 09:17:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-08-07 09:18:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) I use scrapy-redis to crawl a site, scrapy-redis will

python ieee keywords extra

妖精的绣舞 提交于 2019-12-11 17:23:11
问题 I want s to be a dictionary, then I can choose the information I need, I try your code I have a 7862 link need to find the keyword and doi but this 7862 link I tried and found that some will not get the information I want And report an error, so how do you go about this situation?? Here is my complete code. I think I found the focus of it. This is a regular expression error. He refers to the matching part. There is no exact match. For example, this can match https://ieeexplore.ieee.org

automatic crawling web site

百般思念 提交于 2019-12-11 17:18:37
问题 I got help from here to crawl on law.go.kr with the code below. I'm trying to crawl other websites like http://lawbot.org, http://law.go.kr, https://casenote.kr. But problem is that I have no understanding of html... I understood all the code and how to get html address for the code below but it's different on other websites... I want to know how to use the code below to crawl other web pages. import requests from bs4 import BeautifulSoup if __name__ == '__main__': # Using request get 50

Selenium & Scrapy: Last URL overwrites other URLs

送分小仙女□ 提交于 2019-12-11 17:08:43
问题 I am currently trying to crawl data from three websites (three different URLs). Therefore, I am using a text-file to load the different URLs into the start_url. At the moment, there are three URLs in my file. However, the script just saves/overwrites the data of the two URLs before. This is my code: # -*- coding: utf-8 -*- import scrapy from scrapy import Spider from selenium import webdriver from scrapy.selector import Selector from scrapy.http import Request from time import sleep from

How to add ( integrate ) crawljax with crawler4j?

a 夏天 提交于 2019-12-11 16:38:59
问题 I am working on web crawler which fetch data form website using crawler4j and everything goes well but the main problem is with ajax-based events . So, I found crawljax library does this matter but I couldn't where and when to use it . When have I use it ( I mean work sequences )? before fetching page using crawler4j. Or after fetching page using crawler4j. Or have I use url coming using crawler4j and use it to fetch Ajax data (page) using crawljax. 回答1: The library crawljax is basically a

HttpWebRequest with multiple Set-Cookie

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 16:32:12
问题 I am trying to login to wordpress using httpwebrequest, but unable to do so, there are multiple set-cookie in the response header, few cookies are lost and so unable to show the dashboard, however i am able to login using sockets, but as my all coding is built using httwebrequest i cannot switch to sockets. response headers (Status-Line) HTTP/1.1 302 Moved Temporarily Date Wed, 23 Mar 2011 07:52:24 GMT Server Apache X-Powered-By PHP/5.2.17 Expires: Wed, 11 Jan 1984 05:00:00 GMT Cache-Control

StormCrawler: Timeout waiting for connection from pool

三世轮回 提交于 2019-12-11 16:14:00
问题 We are consistently getting the following error when we increase either the number of threads or the number of executors for Fetcher bolt. org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[stormjar.jar:?] at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~

INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

偶尔善良 提交于 2019-12-11 16:03:40
问题 I just began to learn Python and Scrapy. My first project is to crawl information on a website containing web security information. But when I run that using cmd, it says that "Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)" and nothing seems to come out. I'd be grateful if someone kind could solve my problem. My code: import scrapy class SapoSpider(scrapy.Spider): name = "imo" allowed_domains = ["imovirtual.com"] start_urls = ["https://www.imovirtual.com/arrendar

Website blocks Python crawler. Searching for Idea to avoid

泄露秘密 提交于 2019-12-11 15:37:16
问题 I want to crawl data from Object-sites from https://www.fewo-direkt.de (in US https://www.homeaway.com/) like this: https://www.fewo-direkt.de/ferienwohnung-ferienhaus/p8735326 But if the crawler tries to launch the page I'll get only a page with the code below. I think fewo blocks crawler, but I don't know how and wheter there is a pssible way to avoid. Have anyone an idea? Python, requests, BeautifulSoup - With other Websites it works fine. <html style="height:100%"> <head> <meta content=

How to scrape all possible results from a search bar of a website

ぃ、小莉子 提交于 2019-12-11 15:29:48
问题 This is my first web scraping task. I have been tasked with scraping this website It is a site that contains the names of lawyers in Denmark. My difficulty is that I can only retrieve names based on the particular name query i put in the search bar. Is there an online web tool I can use to scrape all the names that the website contains? I have used tools like Import.io with no success so far. I am super confused on how all of this works. 回答1: Please scroll down to UPDATE 2 The website