how to reverse engineer Google's entity ids

爱⌒轻易说出口 提交于 2020-03-05 06:29:49

问题


Google is using entities everywhere nowadays and they are usually prefixed with /m/ and /g/ (but I have also seen some /t/ lately)

I am wondering how the numbering works. For /m/ there is a schema similar to what an url shortener would do. Define an alphabet (in case of /m/ this is 32 characters "0123456789bcdfghjklmnpqrstvwxyz_" and convert a number to a "short url"

e.g. /m/0 4swd <-> 156524 ("/m/0" seems to be a kind of a prefix)

I am stuck with /g/ IDs though. I created a reasonable alphabet from the IDs I have seen "0123456789bcdfghjklmnpqrstvwxyz_" but I can not get it to work.

Since Google is doing some converting itself so I have one real example: /g/11b6377dzp <-> 576462201963131861

from this: Google Search

But I still can not figure this out.

I am mostly interested in the process how to get a handle on this reverse engineering problem (and of course the result). Any ideas?


回答1:


You provided the same alphabet for both cases, but your question implies that they are different. That aside, here's a description of the two encoding schemes.

Quoting from the Freebase developer wiki, here's the encoding for a machine ID:

The keys of machine-generated ids are short variable-length sequences of characters consisting of digits, lower-case letters excluding vowels, and underscore. ... (By avoiding vowels, we hope to avoid accidently [sic] generating offensive identifiers.) Mids are also URL-safe, i.e. they don't require any escaping or unescaping to be used in URLs.

The Google Knowledge Graph IDs are in a separate namespace with the prefix "/g/1" as you noticed and their format, according to the relevant Wikidata property page is

\/g\/1[0-9a-np-z][0-9a-np-z_]{6,8}

so the radix varies by position (no leading underscore allowed) and they chose to only exclude the confusable letter 'o', not all vowels, apparently preferring more encoding space despite the risk of "naughty words."



来源:https://stackoverflow.com/questions/56008271/how-to-reverse-engineer-googles-entity-ids

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!