Should a web-crawler pick up queries? [closed]

廉价感情. 提交于 2019-12-11 18:47:23

问题


The latest days I have coded a web-crawler. The only question I have left is, does "standard" web-crawlers crawl links queries like this one: https://www.google.se/?q=stackoverflow or does it skip the queries and pick them up like this: https://www.google.se


回答1:


In case you are referring to crawling for some sort of indexing of web resources:

The answer is very long but in short my opinion is that: if you have this "page/resource": https://www.google.se/?q=stackoverflow pointed to by many other pages (i.e. it has a large in-link degree) then not integrating it to your index might mean that you miss a very important node in the webgraph. On the other hand, imagine how many links of this type google.com/q="query" are there on the web. Probably a huge number so this would certainly be a huge overhead for your crawler/indexer system.




回答2:


If the link is visited using a GET request then yes, a web browser should crawl it.

There are still lots of websites which use the query string to identify which content is being requested, e.g. in a blog /article.php?article_id=754. If web browsers didn't follow links like these then lots of content on the web would not get indexed.




回答3:


In your particular example, many websites which offer search ban search engine results pages using /robots.txt.

You do need to crawl pages with cgi args, but it's necessary for a robust crawler to understand cgi args which are either irrelevant or harmful.

Crawling using urchin cgi args (utm_campaign etc.) just means you're going to see duplicate content.

Sites that add a session cgi arg to every fetch not only have duplicate content, but some especially clever sites give an error if you show up with a stale cgi arg! This makes them nearly impossible to crawl.

Some sites have links with cgi args which are dangerous to access., e.g. "delete" buttons in a publicly-editable database.

Google webmaster tools has a way to tell google which cgi args should be ignored for your site, but that's not helpful to other search engines. I don't know of anyone working on a robots.txt extension for this issue.

Over the past 4 years, blekko has accreted an awful regex of args which we delete out of URLs. It's a pretty long list!



来源:https://stackoverflow.com/questions/11379486/should-a-web-crawler-pick-up-queries

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!