web-crawler

Crawlable AJAX with _escaped_fragment_ in htaccess

徘徊边缘 提交于 2019-11-30 14:10:38
Hello fellow developers! We are almost finished with developing first phase of our ajax web app. In our app we are using hash fragments like: http://ourdomain.com/#!list=last_ads&order=date I understand google will fetch this url and make a request to the server in this form: http://ourdomain.com/?_escaped_fragment_=list=last_ads?order=date&direction=desc everything is perfect, except... I would like to route this kind of request to another script like so: RewriteCond %{QUERY_STRING} ^_escaped_fragment_=(.*)$ RewriteRule ^$ /webroot/crawler.php$1 [L] The problem is, that when I try to print_r(

Does html5mode(true) affect google search crawlers

元气小坏坏 提交于 2019-11-30 14:10:25
I'm reading this specification which is an agreement between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers. It's stated there that in order for a crawler to index html5 application one must implement routing using #! in URLs. In angular html5mode(true) we get rid of this hashed part of the URL. I'm wondering whether this is going to prevent crawlers from indexing my website. Short answer - No, html5mode will not mess up your indexing, but read on. Important note: Both Google and Bing can crawl AJAX based content without HTML

Recrawl URL with Nutch just for updated sites

让人想犯罪 __ 提交于 2019-11-30 13:43:25
I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? İsmet Alkan Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz . You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare

Scrapy set depth limit per allowed_domains

和自甴很熟 提交于 2019-11-30 12:19:45
问题 I am crawling 6 different allowed_domains and would like to limit the depth of 1 domain. How would I go about limiting the depth of that 1 domain in scrapy? Or would it be possible to crawl only 1 depth of an offsite domains? 回答1: Scrapy doesn't provide anything like this. You can set the DEPTH_LIMIT per-spider, but not per-domain. What can we do? Read the code, drink coffee and solve it (order is important). The idea is to disable Scrapy's built-in DepthMiddleware and provide our custom one

Python Web Crawlers and “getting” html source code

99封情书 提交于 2019-11-30 12:00:28
问题 So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page

Web crawler in ruby [closed]

别等时光非礼了梦想. 提交于 2019-11-30 10:41:29
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . What is your recommendation of writing a web crawler in Ruby? Any lib better than mechanize? 回答1: If you want just to get pages'

Web mining or scraping or crawling? What tool/library should I use? [closed]

别说谁变了你拦得住时间么 提交于 2019-11-30 10:38:30
I want to crawl and save some webpages as HTML. Say, crawl into hundreds popular websites and simply save their frontpages and the "About" pages. I've looked into many questions, but didn't find an answer to this from either web crawling or web scraping questions. What library or tool should I use to build the solution? Or is there even some existing tools that can handle this? There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression. In order to implement something like

selenium implicitly wait doesn't work

孤人 提交于 2019-11-30 09:39:08
问题 This is the first time I use selenium and headless browser as I want to crawl some web page using ajax tech. The effect is great, but for some case it takes too much time to load the whole page(especially when some resource is unavailable),so I have to set a time out for the selenium. First of all I tried set_page_load_timeout() and set_script_timeout() ,but when I set these timeouts, I won't get any page source if the page doesn't load completely, as the codes below: driver = webdriver

Scrapy delay request

爱⌒轻易说出口 提交于 2019-11-30 09:07:20
every time i run my code my ip gets banned. I need help to delay each request for 10 seconds. I've tried to place DOWNLOAD_DELAY in code but it gives no results. Any help is appreciated. # item class included here class DmozItem(scrapy.Item): # define the fields for your item here like: link = scrapy.Field() attr = scrapy.Field() class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["craigslist.org"] start_urls = [ "https://washingtondc.craigslist.org/search/fua" ] BASE_URL = 'https://washingtondc.craigslist.org/' def parse(self, response): links = response.xpath('//a[@class=

JSoup parsing invalid HTML with unclosed tags

一笑奈何 提交于 2019-11-30 08:24:02
问题 Using JSoup inclusive the last release 1.7.2 there is a bug parsing invalid HTML with unclosed tags . Example: String tmp = "<a href='www.google.com'>Link<p>Error link</a>"; Jsoup.parse(tmp); The Document that generate is: <html> <head></head> <body> <a href="www.google.com">Link</a> <p><a>Error link</a></p> </body> </html> The browsers would generate something as: <html> <head></head> <body> <a href="www.google.com">Link</a> <p><a href="www.google.com">Error link</a></p> </body> </html>