I build a RDD from a list of urls, and then try to fetch datas with some async http call. I need all the results before doing other calculs. Ideally, I need to make the http
this use case looks pretty common
Not really, because it simply doesn't work as you (probably) expect. Since each task operates on standard Scala Iterators these operations will be squashed together. It means that all operations will be blocking in practice. Assuming you have three URLs ["x", "y", "z"] you code will be executed in a following order:
Await.result(httpCall("x", 10.seconds))
Await.result(httpCall("y", 10.seconds))
Await.result(httpCall("z", 10.seconds))
You can easily reproduce the same behavior locally. If you want to execute your code asynchronously you should handle this explicitly using mapPartitions:
rdd.mapPartitions(iter => {
??? // Submit requests
??? // Wait until all requests completed and return Iterator of results
})
but this is relatively tricky. There is no guarantee all data for a given partition fits into memory so you'll probably need some batching mechanism as well.
All of that being said I couldn't reproduce the problem you've described to is can be some configuration issue or a problem with httpCall itself.
On a side note allowing a single timeout to kill whole task doesn't look like a good idea.