web-crawler

Scrapy - Understanding CrawlSpider and LinkExtractor

泄露秘密 提交于 2019-12-03 15:03:38
问题 So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule

scrapyd-client command not found

拥有回忆 提交于 2019-12-03 14:53:16
I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme file( https://github.com/scrapy/scrapyd-client ), there should be a 'scrapyd-client' command. I had checked the path '/lib/python2.7/site-packages/scrapyd-client', only 'scrapyd-deploy' in the folder. Is the command 'scrapyd-client' being removed for now? Create a fresh environment and install scrapyd-client first using below pip install git+https://github.com/scrapy/scrapyd

How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

雨燕双飞 提交于 2019-12-03 14:48:33
问题 I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article. The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this: <tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject1">subject1</a> </td> </tr> <tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject2">subject2</a> </td> </tr> Once you click that link it brings you the the

Malicious crawler blocker for ASP.NET

ⅰ亾dé卋堺 提交于 2019-12-03 14:31:30
I have just stumbled upon Bad Behavior - a plugin for PHP that promises to detect spam and malicious crawlers by preventing them from accessing the site at all. Does something similar exist for ASP.NET and ASP.NET MVC? I am interested in blocking access to the site altogether, not in detecting spam after it was posted. EDIT: I am interested specifically in solutions that will detect access patterns to the site - these would prevent screen scraping the site as a whole, or at least make it a very slow process for the offender because the bot would have to act as a regular user in frequency of

Scrapy Spider for JSON Response

北慕城南 提交于 2019-12-03 14:14:50
问题 I am trying to write a spider which crawls through the following JSON response: http://gdata.youtube.com/feeds/api/standardfeeds/UK/most_popular?v=2&alt=json How would the spider look if I would want to crawl all the titles of the videos? All my Spiders dont work. from scrapy.spider import BaseSpider import json from youtube.items import YoutubeItem class MySpider(BaseSpider): name = "youtubecrawler" allowed_domains = ["gdata.youtube.com"] start_urls = ['http://www.gdata.youtube.com/feeds/api

Linking together >100K pages without getting SEO penalized

谁说我不能喝 提交于 2019-12-03 13:51:07
I'm making a site which will have reviews of the privacy policies of hundreds of thousands of other sites on the internet. Its initial content is based on my running through the CommonCrawl 5 billion page web dump and analyzing all the privacy policies with a script, to identify certain characteristics (e.g. "Sells your personal info"). According to the SEO MOZ Beginner's Guide to SEO : Search engines tend to only crawl about 100 links on any given page. This loose restriction is necessary to keep down on spam and conserve rankings. I was wondering what would be a smart way to create a web of

How to collect data from multiple pages into single data structure with scrapy

回眸只為那壹抹淺笑 提交于 2019-12-03 12:59:59
问题 I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations. My problem is that this data is split across two levels in the website. The first page is, say, a list of names and ages with a link to each persons profile page. Their profile page lists their occupation. I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple

crawl links of sitemap.xml through wget command

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-03 12:56:31
I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond: Remote file exists but does not contain any link -- not retrieving. But for sure the sitemap.xml is full of "http://..." links. I tried almost every option of wget but nothing worked for me: wget -r --mirror http://mysite.com/sitemap.xml Does anyone knows how to open all links inside of a website sitemap.xml? Thanks, Dominic It seems that wget can't parse XML. So, you'll have to extract the links manually. You could do something like this: wget --quiet http:/

Can I use WGET to generate a sitemap of a website given its URL?

强颜欢笑 提交于 2019-12-03 12:55:51
问题 I need a script that can spider a website and return the list of all crawled pages in plain-text or similar format; which I will submit to search engines as sitemap. Can I use WGET to generate a sitemap of a website? Or is there a PHP script that can do the same? 回答1: wget --spider --recursive --no-verbose --output-file=wgetlog.txt http://somewebsite.com sed -n "s@.\+ URL:\([^ ]\+\) .\+@\1@p" wgetlog.txt | sed "s@&@\&@" > sedlog.txt This creates a file called sedlog.txt that contains all

Best solution to host a crawler? [closed]

南楼画角 提交于 2019-12-03 12:43:11
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I have a crawler that crawl a few different domains for new posts/content. The total amount of content is hundred of thousands of pages, and there is a lot of new content added each day. So to be able to crawl through all this content, I need my crawler to be crawling 24/7. Currently I host the crawler script on