google-crawlers | 易学教程

Why do search engine crawlers not run javascript? [closed]

阅读更多关于 Why do search engine crawlers not run javascript? [closed]

I have been working with some advanced javascript applications using a lot of ajax requests to render my page. To make the applications crawlable (by google), I have to follow https://developers.google.com/webmasters/ajax-crawling/?hl=fr . This tells us to do something like: redesigning our links, creating html snapshots,... to make the site searchable. I wonder why crawlers don't run javascript to get the rendered page and index on it. Is there a reason behind this? Or it's a missing feature of search engines that may come in the future? Even though GoogleBot actually does handle sites

Does html5mode(true) affect google search crawlers

阅读更多关于 Does html5mode(true) affect google search crawlers

I'm reading this specification which is an agreement between web servers and search engine crawlers that allows for dynamically created content to be visible to crawlers. It's stated there that in order for a crawler to index html5 application one must implement routing using #! in URLs. In angular html5mode(true) we get rid of this hashed part of the URL. I'm wondering whether this is going to prevent crawlers from indexing my website. Short answer - No, html5mode will not mess up your indexing, but read on. Important note: Both Google and Bing can crawl AJAX based content without HTML

how to tell if a web request is coming from google's crawler?

阅读更多关于 how to tell if a web request is coming from google's crawler?

From the HTTP server's perspective. I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks. Requesting IP : 66.249.71.113 Client : Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html ) My logs observe many different IPs for google crawler in 66.249.71.* range. All these IPs are geo-located at Mountain View, CA, USA. A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot and http://www.google.com/bot.html . As I said there are many IPs

Avoid crawling part of a page with “googleoff” and “googleon”

阅读更多关于 Avoid crawling part of a page with “googleoff” and “googleon”

I am trying to tell Google and other search engines not to crawl some parts of my web page. What I do is:  <select name="ddlCountry" id="ddlCountry"> <option value="All">All</option> <option value="bahrain">Bahrain</option> <option value="china">China</option> </select>  After I uploaded the page, I noticed that search engines are stilling rendering elements within the googleoff markup. Am I doing something wrong? "googleon" and "googleoff" are only supported by the Google Search Appliance (when you host your own search results, usually for your own

Display an article rating in Google search results

阅读更多关于 Display an article rating in Google search results

Im writing a review site where the community rates posts. I have noticed that Google can pick up on this ratings and display them in its search results. Does anyone know how this is achieved? An example is a review site like IGN, where in their screen shot below they have indicated their review has a rating of 9.3/10. How can I indicate to Google my own review rating? Maybe some sort of custom meta tag or something. Jordy You can do that with a Span class. Check Google's Structure Data guide for Review : A review is someone's evaluation of something. We support reviews and ratings for a wide

Passing arguments to process.crawl in Scrapy python

阅读更多关于 Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start() I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

阅读更多关于 Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link. Are dynamic pages that are dependent of $_GET variables indexed by google and other

Passing arguments to process.crawl in Scrapy python

阅读更多关于 Passing arguments to process.crawl in Scrapy python

问题 I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start()

Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

阅读更多关于 Are Robots.txt and metadata tags enough to stop search engines to index dynamic pages that are dependent of $_GET variables?

问题 I created a php page that is only accessible by means of token/pass received through $_GET Therefore if you go to the following url you'll get a generic or blank page http://fakepage11.com/secret_page.php However if you used the link with the token it shows you special content http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4 Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through