pyspark split dataframe by two columns without creating a folder structure for the 2nd

问题

Two part question.

I have a pyspark dataframe that I'm reading from a list of JSON files in my azure blob storage.

after some simple ETL I need to move this from blob storage to a datalake as a parquet file, simple so far.

I'm unsucessfully trying to efficiently write this into a folder defined by two columns, one which is a date column and the other an ID

using partitionBy gets me close

id | date                | nested_json_data | path
1  | 2019-01-01 12:01:01 | {data : [data]}  | dbfs:\mnt\..
1  | 2019-01-01 12:01:02 | {data : [data]}  | dbfs:\mnt\..


df.write.partitionBy("id","date").('dbfs:\mnt\curated\...')

this gives me a folder structure as follows :

mnt --- |
         -- ...\1--|
                    --...\1\date=2019-01-01%2012:01:01\{file}.pq
                    --...\1\date=2019-01-01%2012:01:02\{file}.pq

what I'm after is a single folder for each unique id column with each file split out by date.

mnt --- |
         -- ...\1--|
                    --...\1 -- |
                               |filename_2019_01_01_12_0_01.pq
                               |filename_2019_01_01_12_0_02.pq

2nd question is that my date folder name is coming out like Date=2019-12-23 13%3A26%3A00

is there a method to not have that without changing the Schema of my spark dataframe? if i have to create a temp column then that's fine.

回答1:

You can't do that using only partitionBy with Spark. As far as I know, Spark always write partitions as partition=value folders. See spark partition discovery.

What you can do is write with partitionBy("id", "date") in a temporary folder then list recursively the files and move/rename them to get the structure you want.

For the date format, you can transform it before writing:

df = df.withColumn("date") date_format(col("date"), "yyyy_MM_dd_HH_mm_ss")

Here is some code to list recursively all part files from the temporary folder and copy them to the destination folder with rename. It uses Hadoop FS, you can adapt it for your case:

conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileUtil = sc._gateway.jvm.org.apache.hadoop.fs.FileUtil


# list all files from staging folder
staging_folder_path = Path("/tmp/folder/")
fs_staging_f = staging_folder_path.getFileSystem(conf)
staging_files = fs_staging_f.listFiles(staging_folder_path, True)

# filter files that have names starting with 'part"
part_files = []
while staging_files.hasNext():
    part_file_path = staging_files.next().getPath()
    if part_file_path.getName().startswith("part"):
        part_files.append(part_file_path)

# add some logic here to flatten the date folders and add the date pattern to filenames

# move files
FileUtil.copy(....)

来源：https://stackoverflow.com/questions/59534674/pyspark-split-dataframe-by-two-columns-without-creating-a-folder-structure-for-t

标签

python

apache-spark

pyspark