Determining an a priori ranking of what sites a user has most likely visited

雨燕双飞 提交于 2019-12-11 06:51:15

问题


This is for http://cssfingerprint.com

I have a largish database (~100M rows) of websites. This includes both main domains (both 2LD and 3LD) and particular URLs scraped from those domains (whether hosted there [like most blogs] or only linked from it [like Digg], and with a reference to the host domain).

I also scrape the Alexa top million, Bloglines top 1000, Google pagerank, Technorati top 100, and Quantcast top million rankings. Many domains will have no ranking though, or only a partial set; and nearly all sub-domain URLs have no ranking at all other than Google's 0-10 pagerank (some don't even have that).

I can add any new scrapings necessary, assuming it doesn't require a massive amount of spidering.

I also have a fair amount of information about what sites previous users have visited.

What I need is an algorithm that orders these URLs by how likely a visitor is to have visited that URL without any knowledge of the current visitor. (It can, however, use aggregated information about previous users.)

This question is just about the relatively fixed (or at least aggregated) a priori ranking; there's another question that deals with getting a dynamic ranking.

Given that I have limited resources (both computational and financial), what's the best way for me to rank these sites in order of a priori probability of their having been visited?

来源:https://stackoverflow.com/questions/2424701/determining-an-a-priori-ranking-of-what-sites-a-user-has-most-likely-visited

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!