Spark fastest way for creating RDD of numpy arrays

后端 未结 3 1817
长情又很酷
长情又很酷 2020-12-18 11:28

My spark application is using RDD\'s of numpy arrays.
At the moment, I\'m reading my data from AWS S3, and its represented as a simple text file where each line is a ve

3条回答
  •  别那么骄傲
    2020-12-18 12:01

    The best thing to do under these circumstances is to use pandas library for io.
    Please refer to this question : pandas read_csv() and python iterator as input .
    There you will see how to replace the np.loadtxt() function so it would be much faster to
    create a RDD of numpy array.

提交回复
热议问题