web-crawler

Node.JS: How to pass variables to asynchronous callbacks? [duplicate]

浪尽此生 提交于 2019-11-28 16:31:22
This question already has an answer here: JavaScript closure inside loops – simple practical example 43 answers I'm sure my problem is based on a lack of understanding of asynch programming in node.js but here goes. For example: I have a list of links I want to crawl. When each asynch request returns I want to know which URL it is for. But, presumably because of race conditions, each request returns with the URL set to the last value in the list. var links = ['http://google.com', 'http://yahoo.com']; for (link in links) { var url = links[link]; require('request')(url, function() { console.log

Python: Disable images in Selenium Google ChromeDriver

人走茶凉 提交于 2019-11-28 16:27:13
问题 I spend a lot of time searching about this. At the end of the day I combined a number of answers and it works. I share my answer and I'll appreciate it if anyone edits it or provides us with an easier way to do this. 1- The answer in Disable images in Selenium Google ChromeDriver works in Java. So we should do the same thing in Python: opt = webdriver.ChromeOptions() opt.add_extension("Block-image_v1.1.crx") browser = webdriver.Chrome(chrome_options=opt) 2- But downloading "Block-image_v1.1

Automated link-checker for system testing [closed]

不想你离开。 提交于 2019-11-28 15:45:37
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I often have to work with fragile legacy websites that break in unexpected ways when logic or configuration are updated. I don't have the time or knowledge of the system needed to create a Selenium script. Besides, I don't want to check a specific use case - I want to verify every link and page on the site. I

Get a list of URLs from a site [closed]

一个人想着一个人 提交于 2019-11-28 15:22:25
I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous. So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old page URLs. I could do this manually, but I'd be interested if there are any apps that would provide me a list of relative (eg: /page/path, not http:/.../page/path) URLs just given the home page. Like a spider but one that doesn't care about the content other than

How to request Google to re-crawl my website? [closed]

丶灬走出姿态 提交于 2019-11-28 14:56:39
Does someone know a way to request Google to re-crawl a website? If possible, this shouldn't last months. My site is showing an old title in Google's search results. How can I show it with the correct title and description? kevinmicke There are two options. The first (and better) one is using the Fetch as Google option in Webmaster Tools that Mike Flynn commented about. Here are detailed instructions: Go to: https://www.google.com/webmasters/tools/ and log in If you haven't already, add and verify the site with the "Add a Site" button Click on the site name for the one you want to manage Click

A very simple C++ web crawler/spider?

随声附和 提交于 2019-11-28 14:56:38
问题 I am trying to do a very simple web crawler/spider app in C++. I have been searched google for a simple one to understand the concept. And I found this: http://www.example-code.com/vcpp/spider.asp But, its kinda bit complicated/hard to digest for me. What I am trying to do is just, for example: enter the url: www.example.com (i will use bash->wget, to get the contents/source code) then, will look for, maybe "a href" link, and then store in some data file. Any simple tutorial, or guidelines

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

ぃ、小莉子 提交于 2019-11-28 14:10:15
I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link. Are dynamic pages that are dependent of $_GET variables indexed by google and other

Getting text between all tags in a given html and recursively going through links

▼魔方 西西 提交于 2019-11-28 11:55:46
问题 i have checked a couple of posts on stack overflow regarding getting all the words between all the html tags! All of them confused me up! some people recommend regular expression specifically for a single tag while some have mentioned parsing techniques! am basically trying to make a web crawler! for that i have got the html of the link i fetched to my program in a string! i have also extracted the links from the html that i stored in my data string! now i want to crawl through the depth and

Make a JavaScript-aware Crawler

蓝咒 提交于 2019-11-28 11:47:38
I want to make a script that's crawling a website and it should return the locations of all the banners showed on that page. The locations of banners are most of the time from known domains. But banners are not in the HTML as an easy image or swf-file. Most of the times a Javascript is used to show the banner. So if a .swf-file or image-file is loaded from a banner-domain, it should return that url. Is that possible to do? And how could I do that roughly? Best would be if it can also returns the landing page of that ad. How to solve that? You could use selenium to open the pages in a real

Making my own web crawler in python which shows main idea of the page rank

守給你的承諾、 提交于 2019-11-28 11:29:25
问题 I'm trying to make web crawler which shows basic idea of page rank. And code for me seems fine for me but gives me back errors e.x. `Traceback (most recent call last): File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 89, in <module> webpages() File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 17, in webpages get_single_item_data(href) File "C:/Users/Janis/Desktop/WebCrawler/Web_crawler.py", line 23, in get_single_item_data source_code = requests.get(item_url) File "C: