web-crawler

What are the key considerations when creating a web crawler?

主宰稳场 提交于 2019-11-29 00:53:20
问题 I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community. I want to use a crawler to walk over "the web" for a super simple purpose - "does the markup of site XYZ meet condition ABC?". This raises a lot of questions for me, but I think the two main questions I

Python Scrapy on offline (local) data

≯℡__Kan透↙ 提交于 2019-11-28 23:31:52
I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How? SimpleHTTP Server Hosting If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below): python -m SimpleHTTPServer 8000 Then just point scrapy at 127.0.0.1:8000 $ scrapy crawl 127.0.0.1:8000 file:// An alternative is to just have scrapy point to the set of files directly: $ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system Wrapping up Once you've set

Running Multiple spiders in scrapy for 1 website in parallel?

此生再无相见时 提交于 2019-11-28 22:05:12
I want to crawl a website with 2 parts and my script is not as fast as I need. Is it possible to launch 2 spiders, one for scraping the first part and the second one for the second part? I tried to have 2 different classes, and run them scrapy crawl firstSpider scrapy crawl secondSpider but i think that it is not smart. I read the documentation of scrapyd but I don't know if it's good for my case. I think what you are looking for is something like this: import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2

How to Stop the page loading in firefox programmatically?

醉酒当歌 提交于 2019-11-28 21:24:10
I am running several tests with WebDriver and Firefox. I'm running into a problem with the following command: WebDriver.get(www.google.com); With this command, WebDriver blocks till the onload event is fired. While this can normally takes seconds, it can take hours on websites which never finish loading. What I'd like to do is stop loading the page after a certain timeout, somehow simulating Firefox's stop button. I first tried execute the following JS code every time that I tried loading a page: var loadTimeout=setTimeout(\"window.stop();\", 10000); Unfortunately this doesn't work, probably

How to force scrapy to crawl duplicate url?

最后都变了- 提交于 2019-11-28 21:19:10
I am learning Scrapy a web crawling framework. by default it does not crawl duplicate urls or urls which scrapy have already crawled. How to make Scrapy to crawl duplicate urls or urls which have already crawled? I tried to find out on internet but could not find relevant help. I found DUPEFILTER_CLASS = RFPDupeFilter and SgmlLinkExtractor from Scrapy - Spider crawls duplicate urls but this question is opposite of what I am looking You're probably looking for the dont_filter=True argument on Request() . See http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects A more

How do I lock read/write to MySQL tables so that I can select and then insert without other programs reading/writing to the database?

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-28 21:12:39
I am running many instances of a webcrawler in parallel. Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain. Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl. I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in

is there any java script web crawler framework [closed]

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-28 20:36:13
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . Is there any JavaScript web crawler framework? 回答1: There's a new framework that was just release for Node.js called spider. It uses

Java Web Crawler Libraries

ぐ巨炮叔叔 提交于 2019-11-28 20:34:19
问题 I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions. How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions) What libraries should I use? I would assume I need a library for connecting to

Passing arguments to process.crawl in Scrapy python

不问归期 提交于 2019-11-28 20:25:27
I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start() I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and

Can Scrapy be replaced by pyspider?

最后都变了- 提交于 2019-11-28 20:15:09
问题 I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular. pyspider 's home page lists several things being supported out-of-the-box: Powerful WebUI with script editor, task monitor, project manager and result viewer Javascript pages supported! Task priority, retry, periodical and recrawl by age or marks in index page