How to mitigate against long startup times in firebase workers when dataset gets large

Firebase has an interesting feature/nuisance where when you listen on a data ref, you get all the data that was ever added to that ref. So, for example, when you listen on 'child_added', you get a replay of all the children that were added to that ref from the beginning of time. We are writing a commenting system with a dataset that looks something like this:

/comments
/sites
/sites/articles
/users

Sites have many articles and articles have many comments and users have many comments.

We want to be able to track all the comments a user makes, so we feel it is wise to put comments in a separate ref rather than partition them by the articles they belong to. We have a backend listener that needs to do things on new comments as they arrive (increment their child counts, adjust a user's stats etc.). My concern is that, after a while, it will take this listener a long time to start up if it has to process a replay of every comment ever made.

I thought about possibly storing comments only in articles and storing references to each comment's siteId/articleId/commentId in the user table so we could still find all the comments for a given user, but this complicates the backend, as it would then probably need to have a separate listener for each site or even each article, which could make it difficult to manage so many listeners.

Imagine if one of these articles is on a very high-traffic site with tens of thousands of articles and thousands of comments per article. Is the scaling answer to somehow keep track of the traffic levels of every site and set up and partition them in a way that they are assigned to different worker processes? And what about the question of startup time and how long it takes to replay all data every time we load up our workers?

Adding on to Frank's answer, here are a couple other possibilities.

Use a queue strategy

Since the workers are really expecting to process one-time events, then give them one-time events which they can pull from a queue and delete after they finish processing. This resolves the multiple-worker scenario elegantly and ensures nothing is ever missed because a server was offline

Utilize a timestamp to reduce backlog

A simple strategy for avoiding backlog during reboot/startup of the workers is to add a timestamp to all of the events and then do something like the following:

var startTime = Date.now() - 3600 // an hour ago
pathRef.orderByChild('timestamp').startAt( startTime );

Keep track of the last id processed

This only works well with push ids, since formats that do not sort naturally by key will likely become out of order at some point in the future.

When processing records, have your worker keep track of the last record it added by writing that value into Firebase. Then one can use orderByKey().startAt( lastKeyProcessed ) to avoid the backlog. Annoyingly, we then have to discard the first key. However, this is an efficient query, does not cost data storage for an index, and is quick to implement.

If you only need to process new comments once, you can put them in a separate list, e.g. newComments vs. comments (the ones that have been processed). The when you're done processing, move them from newComments to comments.

Alternatively you can keep all comments in a single list like you have today and add a field (e.g. "isNew") to it that you set to true initially. Then you can filter with orderByChild('isNew').equalTo(true) and update({ isNew: false }) once you're done with processing.

来源：https://stackoverflow.com/questions/28617476/how-to-mitigate-against-long-startup-times-in-firebase-workers-when-dataset-gets

标签

firebase

scalability