web-crawler

Scrapy rules not working when process_request and callback parameter are set

帅比萌擦擦* 提交于 2020-01-03 15:54:27
问题 I have this rule for scrapy CrawlSpider rules = [ Rule(LinkExtractor( allow= '/topic/\d+/organize$', restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]' ), process_request='request_tagPage', callback = "parse_tagPage", follow = True) ] request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned,

Will Googlebot crawl changes to the DOM made with JavaScript?

本小妞迷上赌 提交于 2020-01-03 10:46:45
问题 For SEO, I have been tasked with adding a rel="nofollow" to all external links*. The simplest and least obtrusive way to add rel="nofollow" to each external link is with some jQuery. I've done this fine but I'm now wondering: Does Google see changes made during jQuery's document load to the DOM (such as this one) or does it only see the original source code? I don't want to discuss why this is a bad idea or not. This is an SEO consultant's decision and I've learnt that unless implementation

Extract text from a DIV that occurs on multiple pages on a website, then output to .txt?

自作多情 提交于 2020-01-03 06:14:20
问题 Just to note from the start, the content is uncopyrighted and I would like to automate the process of acquiring the text for the purpose of a project. I'd like to extract the text from a particular and recurring DIV ( that is attributed with it's own 'class', in case that makes it easier ) sitting in each page on a simply designed website. There is a single archive page on the site with a list of all of the pages containing the content I would like. The site is www.zenhabits.net I imagine

Use python to crawl a website

痞子三分冷 提交于 2020-01-03 04:45:49
问题 So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops. topLevelLinks = self.getAllUniqueLinks(baseUrl) listOfLinks = list(topLevelLinks) length = len(listOfLinks) count = 0 while(count < length): twoLevelLinks = self

Using PHP and RegEx to fetch all option values from a site's source code

南楼画角 提交于 2020-01-03 03:38:08
问题 I'm learning RegEx and site crawling, and have the following question which, if answered, should speed my learning process up significantly. I have fetched the form element from a web site in htmlencoded format. That is to say, I have the $content string with all the tags intact, like so: $content = "<form name="sth" action=""> <select name="city"> <option value="one">One town</option> <option value="two">Another town</option> <option value="three">Yet Another town</option> ... </select> <

Can't run Scrapy program

为君一笑 提交于 2020-01-03 03:22:13
问题 I have been learning how to work with Scrapy from the following link : http://doc.scrapy.org/en/master/intro/tutorial.html When i try to run the code written in the Crawling( scrapy crawl dmoz ) section, i get the following error: AttributeError: 'module' object has no attribute 'Spider ' However, i changed "Spider" to "spider" and i got nothing but a new error: TypeError: Error when calling the metaclass bases module.__init__() takes at most 2 arguments (3 given) I'm so confused, what is the

Restricting URLs to seed URL domain only crawler4j

倖福魔咒の 提交于 2020-01-03 02:55:52
问题 I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it? Suppose I am adding seed URLs: www.google.com www.yahoo.com www.wikipedia.com Now I am starting the crawling but I want my crawler to visit pages (just like shouldVisit() ) only in above three domains. Obviously there external links, but I want my crawler to restrict to these domains only. Sub-domain, sub-folders are okay, but not outside these domains.

Scrapy-Splash Session Handling

不羁岁月 提交于 2020-01-03 01:23:10
问题 I have been trying to login to a website and then crawl some urls only accesible after signing in. def start_requests(self): script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go(splash.args.url)) splash:set_viewport_full() local search_input = splash:select('input[name=username]') search_input:send_text("MY_USERNAME") splash:evaljs("document.getElementById('password').value = 'MY_PASSWORD';") local submit_button = splash:select('input[name=signin]')

screen scraping using Ghost.py

北城余情 提交于 2020-01-02 23:14:07
问题 Here is the simple program which does not work from ghost import Ghost ghost = Ghost(wait_timeout=40) page, extra_resources = ghost.open("http://samsung.com/in/consumer/mobile-phone/mobile-phone/smartphone/") ghost.wait_page_loaded() n=2; links=ghost.evaluate("alist=document.getElementsByTagName('a');alist") print links ERROR IS: raise Exception(timeout_message) Exception: Unable to load requested page iS there some problem with the program? 回答1: Seem like people are reporting similar issues

screen scraping using Ghost.py

可紊 提交于 2020-01-02 23:14:01
问题 Here is the simple program which does not work from ghost import Ghost ghost = Ghost(wait_timeout=40) page, extra_resources = ghost.open("http://samsung.com/in/consumer/mobile-phone/mobile-phone/smartphone/") ghost.wait_page_loaded() n=2; links=ghost.evaluate("alist=document.getElementsByTagName('a');alist") print links ERROR IS: raise Exception(timeout_message) Exception: Unable to load requested page iS there some problem with the program? 回答1: Seem like people are reporting similar issues