reading binary data into (py) spark DataFrame

问题

I'm ingesting a binary file into Spark -- the file structure is simple, it consists of a series of records and each record holds a number of floats. At the moment, I'm reading in the data in chunks in python and then iterating through individual records to turn them into Row objects that Spark can use to construct a DataFrame. This is very inefficient because instead of processing the data in chunks it requires me to loop through the individual elements.

Is there an obvious (preferred) way of ingesting data like this? Ideally, I would be able to read a chunk of the file (say 10240 records or so) into a buffer, specify the schema and turn this into the DataFrame directly. I don't see a way of doing this with the current API, but maybe I'm missing something?

Here's an example notebook that demonstrates the current procedure: https://gist.github.com/rokroskar/bc0b4713214bb9b1e5ed

Ideally, I would be able to get rid of the for loop over buf in read_batches and just convert the entire batch into an array of Row objects directly.

来源：https://stackoverflow.com/questions/33519796/reading-binary-data-into-py-spark-dataframe

标签

python

apache-spark

apache-spark-sql

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!