Escaped # in URLs, sitemap and handling by Google crawler

陌路散爱 提交于 2019-12-07 14:24:50

问题


We have a large set of URLs of which some contain a hash character. The hash is not to indicate a fragment, but part of the URL path, so we escape the hash by %23, e.g.

http://example.com/example%231
http://example.com/another-example%232
…

Our sitemap.xml lists these URLs as follows:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://example.com/example%231</loc>
  </url>
  <url>
    <loc>http://example.com/another-example%232</loc>
  </url>
  <!-- and so on … -->
</urlset>

Now, the Google Search Console reports 404 errors for the following URLs:

http://example.com/example
http://example.com/another-example

Note, that the strings after the %23 got stripped away. I would understand this behavior, if the sitemap contained e.g. http://example.com/example#1, but we’re intentionally encoding the hash (http://example.com/example%231).

Is there anything I might be misunderstanding, or are there any special rules for escaping within sitemap.xml?


回答1:


Google don't want you to use fragments in that way. They do, however, still see them as actual fragment identifiers, e.g. direct links from a search result to multiple subheadings of Wikipedia articles.

So Google probably interprets your hashes as fragment IDs, and therefore strips them from your URLs, thereby getting 404s.

XML Sitemaps follow usual escaping set out in RSC 3986. There's some history around Google's deprecated use of !# URLs for Ajax that may be useful background.



来源:https://stackoverflow.com/questions/48757020/escaped-in-urls-sitemap-and-handling-by-google-crawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!