web-crawler | 易学教程

Scrapy - Understanding CrawlSpider and LinkExtractor

阅读更多关于 Scrapy - Understanding CrawlSpider and LinkExtractor

问题 So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule

scrapyd-client command not found

阅读更多关于 scrapyd-client command not found

I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme file( https://github.com/scrapy/scrapyd-client ), there should be a 'scrapyd-client' command. I had checked the path '/lib/python2.7/site-packages/scrapyd-client', only 'scrapyd-deploy' in the folder. Is the command 'scrapyd-client' being removed for now? Create a fresh environment and install scrapyd-client first using below pip install git+https://github.com/scrapy/scrapyd

How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

阅读更多关于 How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

问题 I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article. The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this: <tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject1">subject1</a> </td> </tr> <tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject2">subject2</a> </td> </tr> Once you click that link it brings you the the

Malicious crawler blocker for ASP.NET

阅读更多关于 Malicious crawler blocker for ASP.NET

I have just stumbled upon Bad Behavior - a plugin for PHP that promises to detect spam and malicious crawlers by preventing them from accessing the site at all. Does something similar exist for ASP.NET and ASP.NET MVC? I am interested in blocking access to the site altogether, not in detecting spam after it was posted. EDIT: I am interested specifically in solutions that will detect access patterns to the site - these would prevent screen scraping the site as a whole, or at least make it a very slow process for the offender because the bot would have to act as a regular user in frequency of

Scrapy Spider for JSON Response

阅读更多关于 Scrapy Spider for JSON Response

问题 I am trying to write a spider which crawls through the following JSON response: http://gdata.youtube.com/feeds/api/standardfeeds/UK/most_popular?v=2&alt=json How would the spider look if I would want to crawl all the titles of the videos? All my Spiders dont work. from scrapy.spider import BaseSpider import json from youtube.items import YoutubeItem class MySpider(BaseSpider): name = "youtubecrawler" allowed_domains = ["gdata.youtube.com"] start_urls = ['http://www.gdata.youtube.com/feeds/api

Linking together >100K pages without getting SEO penalized

阅读更多关于 Linking together >100K pages without getting SEO penalized

I'm making a site which will have reviews of the privacy policies of hundreds of thousands of other sites on the internet. Its initial content is based on my running through the CommonCrawl 5 billion page web dump and analyzing all the privacy policies with a script, to identify certain characteristics (e.g. "Sells your personal info"). According to the SEO MOZ Beginner's Guide to SEO : Search engines tend to only crawl about 100 links on any given page. This loose restriction is necessary to keep down on spam and conserve rankings. I was wondering what would be a smart way to create a web of

How to collect data from multiple pages into single data structure with scrapy

阅读更多关于 How to collect data from multiple pages into single data structure with scrapy

问题 I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations. My problem is that this data is split across two levels in the website. The first page is, say, a list of names and ages with a link to each persons profile page. Their profile page lists their occupation. I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple

crawl links of sitemap.xml through wget command

阅读更多关于 crawl links of sitemap.xml through wget command

I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond: Remote file exists but does not contain any link -- not retrieving. But for sure the sitemap.xml is full of "http://..." links. I tried almost every option of wget but nothing worked for me: wget -r --mirror http://mysite.com/sitemap.xml Does anyone knows how to open all links inside of a website sitemap.xml? Thanks, Dominic It seems that wget can't parse XML. So, you'll have to extract the links manually. You could do something like this: wget --quiet http:/

Can I use WGET to generate a sitemap of a website given its URL?

阅读更多关于 Can I use WGET to generate a sitemap of a website given its URL?

问题 I need a script that can spider a website and return the list of all crawled pages in plain-text or similar format; which I will submit to search engines as sitemap. Can I use WGET to generate a sitemap of a website? Or is there a PHP script that can do the same? 回答1: wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://somewebsite.com sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt This creates a file called sedlog.txt that contains all

Best solution to host a crawler? [closed]

阅读更多关于 Best solution to host a crawler? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I have a crawler that crawl a few different domains for new posts/content. The total amount of content is hundred of thousands of pages, and there is a lot of new content added each day. So to be able to crawl through all this content, I need my crawler to be crawling 24/7. Currently I host the crawler script on