web-crawler

Errors regarding Web Crawler in PHP

会有一股神秘感。 提交于 2019-12-01 07:35:06
问题 I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent. I have used simple html dom for implementing the crawler while some of the core logic is implemented by me. I am posting the code below and will try to explain the problems. private function initiateChildCrawler($parent_Url_Html) { global $CFG; static $foundLink; static $parentID; static $urlToCrawl_InstanceOfChildren; $forEachCount = 0; foreach($parent_Url_Html

Web crawler Parsing PHP/Javascript links?

徘徊边缘 提交于 2019-12-01 07:14:27
I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links? Lets see if I understood your question correctly. I'm aware that this answer is probably inadequate but if you need a more specific answer I'd need more details. You

Web crawler Parsing PHP/Javascript links?

烈酒焚心 提交于 2019-12-01 04:51:44
问题 I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links? 回答1: Lets see if I understood your question correctly. I'm aware

.htaccess for SEO bots crawling single page applications without hashbangs

岁酱吖の 提交于 2019-12-01 04:35:53
Using a pushState enabled page, normally you redirect SEO bots using the escaped_fragment convention. You can read more about that here . The convention assumes that you will be using a ( #! ) hashbang prefix before all of your URI's on a single page application. SEO bots will escape these fragments by replacing the hashbang with it's own recognizable convention escaped_fragment when making a page request. //Your page http://example.com/#!home //Requested by bots as http://example.com/?_escaped_fragment=home This allows the site administrator to detect bots, and redirect them to a cached

how to totally ignore 'debugger' statement in chrome?

五迷三道 提交于 2019-12-01 04:12:58
'never pause here' can not work after I continue: still paused For totally ignore all breakpoints in the chrome you must do as follows: Open the Chrome browser. Press F12(inspect) or Right-click on the chrome and select inspect In the source menu, press Ctrl+F8 to deactivate all breakpoints(alternative: at the top-right-corner select deactivate breakpoints) All breakpoints and debugger statements will be deactivated. 来源: https://stackoverflow.com/questions/45767855/how-to-totally-ignore-debugger-statement-in-chrome

Scrapy SgmlLinkExtractor question

。_饼干妹妹 提交于 2019-12-01 03:37:52
I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None) I am just using allow=() So, I enter rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),) So, the initial url is 'http://www.whitecase.com/jacevedo/' and I am entering allow=('/aadler',) and expect that '/aadler/' will get scanned as well. But instead, the spider scans the initial url and then closes: [wcase] INFO: Domain

Dynamic rules based on start_urls for Scrapy CrawlSpider?

烈酒焚心 提交于 2019-12-01 01:43:39
I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain). I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately. Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites: class HomepagesSpider

.htaccess for SEO bots crawling single page applications without hashbangs

旧巷老猫 提交于 2019-12-01 01:41:57
问题 Using a pushState enabled page, normally you redirect SEO bots using the escaped_fragment convention. You can read more about that here . The convention assumes that you will be using a ( #! ) hashbang prefix before all of your URI's on a single page application. SEO bots will escape these fragments by replacing the hashbang with it's own recognizable convention escaped_fragment when making a page request. //Your page http://example.com/#!home //Requested by bots as http://example.com/?

How to scroll down in Python Selenium step by step

人盡茶涼 提交于 2019-12-01 01:06:32
Hi guys I am new to Selenium and Python. I was just scraping the site pagalguy website . I know how to scroll down to the bottom of the page but what I need is to scroll down step by step so that the Selenium will click all the readmore buttons,but I don't know how to scroll down step by step like that so I hard coded it like following one browser.execute_script("window.scrollTo(0,300);") browser.find_element_by_link_text("Read More...").click() browser.execute_script("window.scrollTo(300,600);") browser.find_element_by_link_text("Read More...").click() browser.execute_script("window.scrollTo

Scrapy SgmlLinkExtractor question

妖精的绣舞 提交于 2019-12-01 00:06:38
问题 I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None) I am just using allow=() So, I enter rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),) So, the initial url is 'http://www.whitecase.com/jacevedo/' and I am entering allow=('/aadler',) and expect that '/aadler/' will get