What is the best way to compute trending topics or tags?

后端 未结 11 2006
太阳男子
太阳男子 2020-12-04 04:34

Many sites offer some statistics like \"The hottest topics in the last 24h\". For example, Topix.com shows this in its section \"News Trends\". There, you can see the topics

11条回答
  •  死守一世寂寞
    2020-12-04 04:51

    Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.

    One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:

    a_n = a_(n-1)*b + c_n*(1-b)
    

    Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).

    The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.

    EDIT

    If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.

    Let's say the new values are 5,0,0,1,4:

    a_0 = 1
    c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
    c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
    c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
    c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
    c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
    

    Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:

    a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
    

    What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.

    If you study the above formula, you should realize a few things about this usage:

    1. All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
    2. Recent values contribute more than older values.
    3. The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.

    I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):

    >>> class EMA(object):
    ...  def __init__(self, base, decay):
    ...   self.val = base
    ...   self.decay = decay
    ...   print self.val
    ...  def update(self, value):
    ...   self.val = self.val*self.decay + (1-self.decay)*value
    ...   print self.val
    ... 
    >>> a = EMA(1, .9)
    1
    >>> a.update(10)
    1.9
    >>> a.update(10)
    2.71
    >>> a.update(10)
    3.439
    >>> a.update(10)
    4.0951
    >>> a.update(10)
    4.68559
    >>> a.update(10)
    5.217031
    >>> a.update(10)
    5.6953279
    >>> a.update(10)
    6.12579511
    >>> a.update(10)
    6.513215599
    >>> a.update(10)
    6.8618940391
    >>> a.update(10)
    7.17570463519
    

提交回复
热议问题