Azure Data-bricks : How to read part files and save it as one file to blob?

问题

I am using Python spark writing a data-frame to a folder in blob which gets saved as part files :

df.write.format("json").save("/mnt/path/DataModel")

Files are saved as :

i am using following code to merge it into one file :

#Read Part files
 path = glob.glob("/dbfs/mnt/path/DataModel/part-000*.json")  

#Move file to FinalData folder in blbo
    for file in path: 
          shutil.move(file,"/dbfs/mnt/path/FinalData/FinalData.json")

But FinalData.Json only have last part file data and not data of all part files.

回答1:

I see you want to simply merge the content of these files to a file, but due to the description of the shutil.move function as the figure below, its feature is like Linux mv, so the content of the last file will cover the content of the previous files.

And the reason of writing multiple files by the code is that Spark working on HDFS, so more than 128MB (HDFS part file size) data writen on HDFS will generate multiple files named with the part prefix, please refer to What is Small file problem in HDFS ?.

A workaround solution to satisfy your needs is to convert a PySpark dataframe to a Pandas dataframe and then to use pandas dataframe function to_json to write a json file.

Here is my sample code.

df.toPandas().to_json('/dbfs/mnt/path/FinalData/FinalData.json')

And then check if file exists.

import os
os.path.isfile('/dbfs/mnt/path/FinalData/FinalData.json')

dbutils.fs.ls('dbfs:/mnt/path/')

As reference, the figure below is my result.

For your other question, to read part files using PySpark is to pass a wildcard path to function spark.read.json() as the code below.

spark.read.json('dbfs:/mnt/path/DataModel/part-*.json')

来源：https://stackoverflow.com/questions/58964745/azure-data-bricks-how-to-read-part-files-and-save-it-as-one-file-to-blob

标签

python

azure

apache-spark

databricks

azure-databricks