web-crawler | 易学教程

How to limit number of followed pages per site in Python Scrapy

阅读更多关于 How to limit number of followed pages per site in Python Scrapy

I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website . Here is my spider: class DownloadSpider(CrawlSpider): name = 'downloader' download_path = '/home/MyProjects/crawler' rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),) def __init__(self, *args, **kwargs): super(DownloadSpider, self).__init__(*args, **kwargs) self.urls_file_path =

Apache Nutch 2.1 different batch id (null)

阅读更多关于 Apache Nutch 2.1 different batch id (null)

I crawl few sites with Apache Nutch 2.1. While crawling I see the following message on lot of pages: ex. Skipping http://www.domainname.com/news/subcategory/111111/index.html ; different batch id (null). What causes this error ? How can I resolve this problem, because the pages with different batch id (null) are not stored in database. The site that I crawled is based on drupal, but i have tried on many others non drupal sites. I think, the message is not problem. batch_id not assigned to all of url. So, if batch_id is null , skip url. Generate url when batch_id assined for url. 来源： https:/

Can I use WGET to generate a sitemap of a website given its URL?

阅读更多关于 Can I use WGET to generate a sitemap of a website given its URL?

I need a script that can spider a website and return the list of all crawled pages in plain-text or similar format; which I will submit to search engines as sitemap. Can I use WGET to generate a sitemap of a website? Or is there a PHP script that can do the same? wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://somewebsite.com sed -n "s@.\+ URL:$[^ ]\+$ .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt This creates a file called sedlog.txt that contains all links found on the specified website. You can use PHP or a shell script to convert the text file sitemap into an

Web Crawler - Ignore Robots.txt file?

阅读更多关于 Web Crawler - Ignore Robots.txt file?

Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python. The documentation for mechanize has this sample code: br = mechanize.Browser() .... # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) That does exactly what you want. This looks like what you need: from mechanize import Browser br = Browser() # Ignore robots.txt br.set_handle_robots( False ) but you know what you're doing… 来源： https:/

How to crawl Facebook based on friendship information?

阅读更多关于 How to crawl Facebook based on friendship information?

问题 I'm a graduate student whose research is complex network. I am working on a project that involves analyzing connections between Facebook users. Is it possible to write a crawler for Facebook based on friendship information? I looked around but couldn't find any things useful so far. It seems Facebook isn't fond of such activity. Can I rely on the Facebook API? Update (Jan-08-2010): Thank you very much for the responses. I guess I probably need to contact Facebook directly then. Cheers Update

Crawling Google Search with PHP

阅读更多关于 Crawling Google Search with PHP

问题 I trying to get my head around how to fetch Google search results with PHP or JavaScript. I know it has been possible before but now I can't find a way. I am trying to duplicate (somewhat) the functionality of http://www.getupdated.se/sokmotoroptimering/seo-verktyg/kolla-ranking/ But really the core issue I want to solve is just to get the search result via PHP or JavaScript,the rest i can figure out. Fetching the results using file_get_contents() or cURL doesn't seem to work. Example: $ch =

Is it possible to develop a powerful web search engine using Erlang, Mnesia & Yaws?

阅读更多关于 Is it possible to develop a powerful web search engine using Erlang, Mnesia & Yaws?

问题 I am thinking of developing a web search engine using Erlang, Mnesia & Yaws. Is it possible to make a powerful and the fastest web search engine using these software? What will it need to accomplish this and how what do I start with? 回答1: Erlang can make the most powerful web crawler today. Let me take you through my simple crawler. Step 1. I create a simple parallelism module, which i call mapreduce -module(mapreduce). -export([compute/2]). %%=================================================

How to use Goutte

阅读更多关于 How to use Goutte

Issue : Cannot fully understand the Goutte web scraper. Request : Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form? Webpage Layout attempting to be scraped : Step 1 : The webpage has a form has a radio button to choose what kind of form to fill out (ie. Name or License). It is

unknown command: crawl error

阅读更多关于 unknown command: crawl error

问题 I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it didn't workout). I followed the tutorial and installed scrapy on my machine. I have created one project, demoz. But when I enter scrapy crawl demoz it shows an error. I came across this thing when i hit scrapy command under (C:\python27\scripts) it shows: C:\Python27\Scripts>scrapy Scrapy 0.14.2 - no active project Usage: scrapy <command> [options] [args] Available commands: fetch Fetch a

Is there a list of known web crawlers? [closed]

阅读更多关于 Is there a list of known web crawlers? [closed]

Closed. This question is off-topic. It is not currently accepting answers. Learn more . Want to improve this question? Update the question so it's on-topic for Stack Overflow. I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know. Is there somewhere a list of know web crawlers with some documentation like user agent, IPs, behavior, etc? I'm not interested in the