Azure Data-bricks : How to read part files and save it as one file to blob?

左心房为你撑大大i 提交于 2020-01-25 08:34:48

问题


I am using Python spark writing a data-frame to a folder in blob which gets saved as part files :

df.write.format("json").save("/mnt/path/DataModel")

Files are saved as :

i am using following code to merge it into one file :

#Read Part files
 path = glob.glob("/dbfs/mnt/path/DataModel/part-000*.json")  

#Move file to FinalData folder in blbo
    for file in path: 
          shutil.move(file,"/dbfs/mnt/path/FinalData/FinalData.json")

But FinalData.Json only have last part file data and not data of all part files.


回答1:


I see you want to simply merge the content of these files to a file, but due to the description of the shutil.move function as the figure below, its feature is like Linux mv, so the content of the last file will cover the content of the previous files.

And the reason of writing multiple files by the code is that Spark working on HDFS, so more than 128MB (HDFS part file size) data writen on HDFS will generate multiple files named with the part prefix, please refer to What is Small file problem in HDFS ?.

A workaround solution to satisfy your needs is to convert a PySpark dataframe to a Pandas dataframe and then to use pandas dataframe function to_json to write a json file.

Here is my sample code.

df.toPandas().to_json('/dbfs/mnt/path/FinalData/FinalData.json')

And then check if file exists.

import os
os.path.isfile('/dbfs/mnt/path/FinalData/FinalData.json')

Or

dbutils.fs.ls('dbfs:/mnt/path/')

As reference, the figure below is my result.

For your other question, to read part files using PySpark is to pass a wildcard path to function spark.read.json() as the code below.

spark.read.json('dbfs:/mnt/path/DataModel/part-*.json')


来源:https://stackoverflow.com/questions/58964745/azure-data-bricks-how-to-read-part-files-and-save-it-as-one-file-to-blob

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!