Spark fastest way for creating RDD of numpy arrays

后端未结

关注

 3  1816

长情又很酷 2020-12-18 11:28

My spark application is using RDD\'s of numpy arrays.
At the moment, I\'m reading my data from AWS S3, and its represented as a simple text file where each line is a ve

3条回答

鱼传尺愫 (楼主)

2020-12-18 11:40
It would be a little bit more idiomatic and slightly faster to simply map with numpy.fromstring as follows:
```
import numpy as np.

path = ...
initial_num_of_partitions = ...

data = (sc.textFile(path, initial_num_of_partitions)
   .map(lambda s: np.fromstring(s, dtype=np.float64, sep=" ")))
```
but ignoring that there is nothing particularly wrong with your approach. As far as I can tell, with basic configuration, it is roughly twice a slow a simply reading the data and slightly slower than creating dummy numpy arrays.

So it looks like the problem is somewhere else. It could be cluster misconfiguration, cost of fetching data from S3 or even unrealistic expectations.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...