web-crawler | 易学教程

Scrapy rules not working when process_request and callback parameter are set

阅读更多关于 Scrapy rules not working when process_request and callback parameter are set

问题 I have this rule for scrapy CrawlSpider rules = [ Rule(LinkExtractor( allow= '/topic/\d+/organize$', restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]' ), process_request='request_tagPage', callback = "parse_tagPage", follow = True) ] request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned,

Will Googlebot crawl changes to the DOM made with JavaScript?

阅读更多关于 Will Googlebot crawl changes to the DOM made with JavaScript?

问题 For SEO, I have been tasked with adding a rel="nofollow" to all external links*. The simplest and least obtrusive way to add rel="nofollow" to each external link is with some jQuery. I've done this fine but I'm now wondering: Does Google see changes made during jQuery's document load to the DOM (such as this one) or does it only see the original source code? I don't want to discuss why this is a bad idea or not. This is an SEO consultant's decision and I've learnt that unless implementation

Extract text from a DIV that occurs on multiple pages on a website, then output to .txt?

阅读更多关于 Extract text from a DIV that occurs on multiple pages on a website, then output to .txt?

问题 Just to note from the start, the content is uncopyrighted and I would like to automate the process of acquiring the text for the purpose of a project. I'd like to extract the text from a particular and recurring DIV ( that is attributed with it's own 'class', in case that makes it easier ) sitting in each page on a simply designed website. There is a single archive page on the site with a list of all of the pages containing the content I would like. The site is www.zenhabits.net I imagine

Use python to crawl a website

阅读更多关于 Use python to crawl a website

问题 So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops. topLevelLinks = self.getAllUniqueLinks(baseUrl) listOfLinks = list(topLevelLinks) length = len(listOfLinks) count = 0 while(count < length): twoLevelLinks = self

Using PHP and RegEx to fetch all option values from a site's source code

阅读更多关于 Using PHP and RegEx to fetch all option values from a site's source code

问题 I'm learning RegEx and site crawling, and have the following question which, if answered, should speed my learning process up significantly. I have fetched the form element from a web site in htmlencoded format. That is to say, I have the $content string with all the tags intact, like so: $content = "<form name="sth" action=""> <select name="city"> <option value="one">One town</option> <option value="two">Another town</option> <option value="three">Yet Another town</option> ... </select> <

Can't run Scrapy program

阅读更多关于 Can't run Scrapy program

问题 I have been learning how to work with Scrapy from the following link : http://doc.scrapy.org/en/master/intro/tutorial.html When i try to run the code written in the Crawling( scrapy crawl dmoz ) section, i get the following error: AttributeError: 'module' object has no attribute 'Spider ' However, i changed "Spider" to "spider" and i got nothing but a new error: TypeError: Error when calling the metaclass bases module.__init__() takes at most 2 arguments (3 given) I'm so confused, what is the

Restricting URLs to seed URL domain only crawler4j

阅读更多关于 Restricting URLs to seed URL domain only crawler4j

问题 I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it? Suppose I am adding seed URLs: www.google.com www.yahoo.com www.wikipedia.com Now I am starting the crawling but I want my crawler to visit pages (just like shouldVisit() ) only in above three domains. Obviously there external links, but I want my crawler to restrict to these domains only. Sub-domain, sub-folders are okay, but not outside these domains.

Scrapy-Splash Session Handling

阅读更多关于 Scrapy-Splash Session Handling

问题 I have been trying to login to a website and then crawl some urls only accesible after signing in. def start_requests(self): script = """ function main(splash) splash:init_cookies(splash.args.cookies) assert(splash:go(splash.args.url)) splash:set_viewport_full() local search_input = splash:select('input[name=username]') search_input:send_text("MY_USERNAME") splash:evaljs("document.getElementById('password').value = 'MY_PASSWORD';") local submit_button = splash:select('input[name=signin]')

screen scraping using Ghost.py

阅读更多关于 screen scraping using Ghost.py

问题 Here is the simple program which does not work from ghost import Ghost ghost = Ghost(wait_timeout=40) page, extra_resources = ghost.open("http://samsung.com/in/consumer/mobile-phone/mobile-phone/smartphone/") ghost.wait_page_loaded() n=2; links=ghost.evaluate("alist=document.getElementsByTagName('a');alist") print links ERROR IS: raise Exception(timeout_message) Exception: Unable to load requested page iS there some problem with the program? 回答1: Seem like people are reporting similar issues

screen scraping using Ghost.py

阅读更多关于 screen scraping using Ghost.py