web-crawler

Typical politeness factor for a web crawler?

那年仲夏 提交于 2019-12-02 18:37:37
What is a typical politeness factor for a web crawler? Apart from always obeying robot.txt Both the "Disallow:" and non standard "Crawl-delay:" But if a site does not specify an explicit crawl-delay what should the default value be set at? The algorithm we use is: // If we are blocked by robots.txt // Make sure it is obeyed. // Our bots user-agent string contains a link to a html page explaining this. // Also an email address to be added to so that we never even consider their domain in the future // If we receive more that 5 consecutive responses with HTTP response code of 500+ (or timeouts)

Alternative to HtmlUnit

此生再无相见时 提交于 2019-12-02 18:14:49
I have been researching about the headless browsers available till to date and found HtmlUnit being used pretty extensively. Do we have any alternative to HtmlUnit with possible advantage compared to HtmlUnit? Thanks Nayn As far as I know, HtmlUnit` is the most powerful headless browser. What are you issues with it? Sajid Hussain There are many other libraries that you can use for this. If you need to scrape xml base data use JTidy . If you need to scrape specific data from HTML you can use Jsoup . Well I use jsoup - it's pretty much faster than any other API. WebDriver with a virtual

Identifying large bodies of text via BeautifulSoup or other python based extractors

我是研究僧i 提交于 2019-12-02 17:48:16
Given some random news article , I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page. The original plan was to use a BeautifulSoup findAll(True) and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags) But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph

How do Scrapy rules work with crawl spider

感情迁移 提交于 2019-12-02 17:36:38
I have hard time to understand scrapy crawl spider rules. I have example that doesn't work as I would like it did, so it can be two things: I don't understand how rules work. I formed incorrect regex that prevents me to get results that I need. OK here it is what I want to do: I want to write crawl spider that will get all available statistics information from http://www.euroleague.net website. The website page that hosts all the information that I need for the start is here . Step 1 First step what I am thinking is extract "Seasons" link(s) and fallow it. Here it is HTML/href that I am

dynamic start_urls in scrapy

别等时光非礼了梦想. 提交于 2019-12-02 17:19:04
I'm using scrapy to crawl multiple pages on a site. The variable start_urls is used to define pages to be crawled. I would initially start with 1st page, thus defining start_urls = [1st page] in the file example_spider.py Upon getting more info from 1st page, I would determine what are next pages to be crawled, then would assign start_urls accordingly. Hence, I have to overwrite above example_spider.py with changes to start_urls = [1st page, 2nd page, ..., Kth page] , then run scrapy crawl again. Is that the best approach or is there a better way to dynamically assign start_urls using scrapy

How to 'Grab' content from another website

痴心易碎 提交于 2019-12-02 17:01:01
问题 A friend has asked me this, and I couldn't answer. He asked: I am making this site where you can archive your site... It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down. How could he do this? (php?) and what would be some requirements? 回答1: It

Rotating Proxies for web scraping

不问归期 提交于 2019-12-02 16:26:03
I've got a python web crawler and I want to distribute the download requests among many different proxy servers, probably running squid (though I'm open to alternatives). For example, it could work in a round-robin fashion, where request1 goes to proxy1, request2 to proxy2, and eventually looping back around. Any idea how to set this up? To make it harder, I'd also like to be able to dynamically change the list of available proxies, bring some down, and add others. If it matters, IP addresses are assigned dynamically. Thanks :) Make your crawler have a list of proxies and with each HTTP

How to crawl a website/extract data into database with python?

北慕城南 提交于 2019-12-02 16:25:36
I'd like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data. How would that work? What tools/libraries can/should I use? Are there good tutorials on that? How do I best deal with binary data (e.g. pretty pdf)? Are there already good solutions for that? Acorn requests for downloading the pages. Here's an example of how to login to a website and download

Scrapy - logging to file and stdout simultaneously, with spider names

柔情痞子 提交于 2019-12-02 15:58:29
I've decided to use the Python logging module because the messages generated by Twisted on std error is too long, and I want to INFO level meaningful messages such as those generated by the StatsCollector to be written on a separate log file while maintaining the on screen messages. from twisted.python import log import logging logging.basicConfig(level=logging.INFO, filemode='w', filename='buyerlog.txt') observer = log.PythonLoggingObserver() observer.start() Well, this is fine, I've got my messages, but the downside is that I do not know the messages are generated by which spider! This is my

Strategy for how to crawl/index frequently updated webpages?

亡梦爱人 提交于 2019-12-02 14:57:18
I'm trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store and index their frontpage or any of their main pages, then within hours my index for that page will be out of date. Does a large search engine such as Google have an algorithm to re-crawl frequently updated pages very frequently, hourly even? Or does it just score frequently updated pages very low so they don't get returned? How can I handle this in my own application? Good question. This is actually an active topic in WWW