Designing an access/web statistics counter module for appengine

前端 未结 3 860
情书的邮戳
情书的邮戳 2021-01-15 03:04

I need a access statistics module for appengine that tracks a few request-handlers and collects statistics to bigtable. I have not found any ready made solution on github an

相关标签:
3条回答
  • 2021-01-15 03:35

    Here is the code for the implementation of the task-queue approach with hourly timeframe. Interestingly it works without transactions and other mutex magic. (For readability the python indent of methods is wrong.)

    Supporting priorities for memcache would increase accuracy of this solution.

    TASK_URL = '/h/statistics/collect/' # Example: '/h/statistics/collect/{counter-id}"?groupId=" + groupId + "&countableId=" + countableId'
    MEMCACHE_PREFIX = "StatisticsDB_"
    
    class StatisticsDB(ndb.Model):
    """
    Memcached counting saved each hour to DB.
    """
        # key.id() = 2016-01-31-17_groupId_countableId
        countableId = ndb.StringProperty(required=True) # unique name of counter within group
        groupId = ndb.StringProperty() # couter group (allows single DB query for group of counters)
        count = ndb.IntegerProperty(default=0) # count per timeframe
    
    
    @classmethod
    def increment(cls, groupId, countableId):  # throws InvalidTaskNameError
        """
        Increment a counter. countableId is the unique id of the countable
        throws InvalidTaskNameError if ids do not match: [a-zA-Z0-9-_]{1,500}
        """
        # Calculate memcache key and db_key at this time
        # the counting timeframe is 1h, determined by %H, MUST MATCH ETA calculation in _add_task()
        counter_key = datetime.datetime.utcnow().strftime("%Y-%m-%d-%H") + "_" + groupId +"_"+ countableId;
        client = memcache.Client()
    
        n = client.incr(MEMCACHE_PREFIX + counter_key)
        if n is None:
            cls._add_task(counter_key, groupId, countableId)
            client.incr(MEMCACHE_PREFIX + counter_key, initial_value=0)
    
    
    @classmethod
    def _add_task(cls, counter_key, groupId, countableId):
        taskurl = TASK_URL + counter_key + "?groupId=" + groupId + "&countableId=" + countableId
        now = datetime.datetime.now()
        # the counting timeframe is 1h, determined by counter_key, MUST MATCH ETA calculation
        eta = now + datetime.timedelta(minutes = (61-now.minute)) # at most 1h later, randomized over 1 minute, throttled by queue parameters
        task = taskqueue.Task(url=taskurl, method='GET', name=MEMCACHE_PREFIX + counter_key, eta=eta)
        queue = taskqueue.Queue(name='StatisticsDB')
        try:
            queue.add(task)
        except taskqueue.TaskAlreadyExistsError: # may also occur if 2 increments are done simultaneously
            logging.warning("StatisticsDB TaskAlreadyExistsError lost memcache for %s", counter_key)
        except taskqueue.TombstonedTaskError: # task name is locked for ...
            logging.warning("StatisticsDB TombstonedTaskError some bad guy ran this task premature manually %s", counter_key)
    
    
    @classmethod
    def save2db_task_handler(cls, counter_key, countableId, groupId):
        """
        Save counter from memcache to DB. Idempotent method.
        At the time this executes no more increments to this counter occur.
        """
        dbkey = ndb.Key(StatisticsDB, counter_key)
    
        n = memcache.get(MEMCACHE_PREFIX + counter_key)        
        if n is None:
            logging.warning("StatisticsDB lost count for %s", counter_key)
            return
    
        stats = StatisticsDB(key=dbkey, count=n, countableId=countableId, groupId=groupId)
        stats.put()
        memcache.delete(MEMCACHE_PREFIX + counter_key) # delete if put succeeded
        logging.info("StatisticsDB saved %s n = %i", counter_key, n)
    
    0 讨论(0)
  • 2021-01-15 03:49

    Yep, your #2 idea seems to best address your requirements.

    To implement it you need a task execution with a specified delay.

    I used the deferred library for such purpose, using the deferred.defer()'s countdown argument. I learned in the meantime that the standard queue library has similar support, by specifying the countdown argument for a Task constructor (I have yet to use this approach, tho).

    So whenever you create a memcache counter also enqueue a delayed execution task (passing in its payload the counter's memcache key) which will:

    • get the memcache counter value using the key from the task payload
    • add the value to the corresponding db counter
    • delete the memcache counter when the db update is successful

    You'll probably lose the increments from concurrent requests between the moment the memcache counter is read in the task execution and the memcache counter being deleted. You could reduce such loss by deleting the memcache counter immediately after reading it, but you'd risk losing the entire count if the DB update fails for whatever reason - re-trying the task would no longer find the memcache counter. If neither of these is satisfactory you could further refine the solution:

    The delayed task:

    • reads the memcache counter value
    • enqueues another (transactional) task (with no delay) for adding the value to the db counter
    • deletes the memcache counter

    The non-delayed task is now idempotent and can be safely re-tried until successful.

    The risk of loss of increments from concurrent requests still exists, but I guess it's smaller.

    Update:

    The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional countdown or eta arguments to taskqueue.add():

    • countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if you specified an eta.

    • eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if the countdown argument is specified. This argument can be time zone-aware or time zone-naive, or set to a time in the past. If the argument is set to None, the default value is now. For pull tasks, no worker can lease the task before the time indicated by the eta argument.

    0 讨论(0)
  • 2021-01-15 03:55

    Counting things in a distributed system is a hard problem. There's some good info on the problem from the early days of App Engine. I'd start with Sharding Counter, which, despites being written in 2008, is still relevant.

    0 讨论(0)
提交回复
热议问题