google-crawlers

Why do search engine crawlers not run javascript? [closed]

偶尔善良 提交于 2019-12-01 03:12:39
I have been working with some advanced javascript applications using a lot of ajax requests to render my page. To make the applications crawlable (by google), I have to follow https://developers.google.com/webmasters/ajax-crawling/?hl=fr . This tells us to do something like: redesigning our links, creating html snapshots,... to make the site searchable. I wonder why crawlers don't run javascript to get the rendered page and index on it. Is there a reason behind this? Or it's a missing feature of search engines that may come in the future? Even though GoogleBot actually does handle sites

Does html5mode(true) affect google search crawlers

元气小坏坏 提交于 2019-11-30 14:10:25
I'm reading this specification which is an agreement between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers. It's stated there that in order for a crawler to index html5 application one must implement routing using #! in URLs. In angular html5mode(true) we get rid of this hashed part of the URL. I'm wondering whether this is going to prevent crawlers from indexing my website. Short answer - No, html5mode will not mess up your indexing, but read on. Important note: Both Google and Bing can crawl AJAX based content without HTML

how to tell if a web request is coming from google's crawler?

与世无争的帅哥 提交于 2019-11-29 14:15:11
From the HTTP server's perspective. I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks. Requesting IP : 66.249.71.113 Client : Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html ) My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA. A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot and http://www.google.com/bot.html . As I said there are many IPs

Avoid crawling part of a page with “googleoff” and “googleon”

耗尽温柔 提交于 2019-11-29 09:13:18
I am trying to tell Google and other search engines not to crawl some parts of my web page. What I do is: <!--googleoff: all--> <select name="ddlCountry" id="ddlCountry"> <option value="All">All</option> <option value="bahrain">Bahrain</option> <option value="china">China</option> </select> <!--googleon: all--> After I uploaded the page, I noticed that search engines are stilling rendering elements within the googleoff markup. Am I doing something wrong? "googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own

Display an article rating in Google search results

二次信任 提交于 2019-11-29 01:16:37
Im writing a review site where the community rates posts. I have noticed that Google can pick up on this ratings and display them in its search results. Does anyone know how this is achieved? An example is a review site like IGN, where in their screen shot below they have indicated their review has a rating of 9.3/10. How can I indicate to Google my own review rating? Maybe some sort of custom meta tag or something. Jordy You can do that with a Span class. Check Google's Structure Data guide for Review : A review is someone's evaluation of something. We support reviews and ratings for a wide

Passing arguments to process.crawl in Scrapy python

不问归期 提交于 2019-11-28 20:25:27
I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start() I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

ぃ、小莉子 提交于 2019-11-28 14:10:15
I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link. Are dynamic pages that are dependent of $_GET variables indexed by google and other

Passing arguments to process.crawl in Scrapy python

我的梦境 提交于 2019-11-27 13:05:09
问题 I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start()

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

蹲街弑〆低调 提交于 2019-11-27 08:27:54
问题 I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through