问题
I am extremely new to spark and have a specific workflow associated question. Although it is not really a coding related question, it is more a spark functionality related question and I thought it would be appropriate here. Please feel free to redirect me to the correct site if you think this question is inappropriate for SO.
So here goes: 1. I am planning to consume a stream of requests using Spark's Sliding Window functionality and calculate a recommendation model. Once the model is calculated, would it be possible for a web-service to query and consume this data directly from an RDD? If so could anyone point me toward some example code of how this can be achieved?
- If not, I would like to store the data in memcached as the data that I am storing is currently not too large, it is mostly for the in-memory iterative calculation and streaming support purposes that I am using Spark, so is it possible to load RDD data into memcached? I'm asking because I could only find a Mongo connector for Spark and couldn't find a Memcached connector.
Any help and especially specific code examples/ links would be much appreciated.
Thanks in advance.
回答1:
You can't query an RDD directly in this way. Think of your Spark job as a stream processor. What you can do is push the updated model to some "store", such as a database (with a custom API or JDBC), a file system, or memcached. You could even make a web service call from within the Spark code.
Whatever you do, be careful that the time to process each batch of data, including I/O, is well under the interval time you specify. Otherwise, you risk bottlenecks that might eventually crash.
One other thing to watch for is the case where you have your model data in more than one RDD partition spread over the cluster, (which is the default of course). If the order of your "records" doesn't matter, then writing them out in parallel is fine. If you need a specific total order written out sequentially (and the data really isn't large), call collect
to bring them into one in-memory data structure inside your driver code (which will mean network traffic in a distributed job), then write from there.
来源:https://stackoverflow.com/questions/29459659/load-spark-data-into-mongo-memcached-for-use-by-a-webservice