问题
Two part question.
I have a pyspark dataframe that I'm reading from a list of JSON files in my azure blob storage.
after some simple ETL I need to move this from blob storage to a datalake as a parquet file, simple so far.
I'm unsucessfully trying to efficiently write this into a folder defined by two columns, one which is a date column and the other an ID
using partitionBy
gets me close
id | date | nested_json_data | path
1 | 2019-01-01 12:01:01 | {data : [data]} | dbfs:\mnt\..
1 | 2019-01-01 12:01:02 | {data : [data]} | dbfs:\mnt\..
df.write.partitionBy("id","date").('dbfs:\mnt\curated\...')
this gives me a folder structure as follows :
mnt --- |
-- ...\1--|
--...\1\date=2019-01-01%2012:01:01\{file}.pq
--...\1\date=2019-01-01%2012:01:02\{file}.pq
what I'm after is a single folder for each unique id
column with each file split out by date.
mnt --- |
-- ...\1--|
--...\1 -- |
|filename_2019_01_01_12_0_01.pq
|filename_2019_01_01_12_0_02.pq
2nd question is that my date folder name is coming out like Date=2019-12-23 13%3A26%3A00
is there a method to not have that without changing the Schema of my spark dataframe? if i have to create a temp column then that's fine.
回答1:
You can't do that using only partitionBy
with Spark. As far as I know, Spark always write partitions as partition=value
folders. See spark partition discovery.
What you can do is write with partitionBy("id", "date")
in a temporary folder then list recursively the files and move/rename them to get the structure you want.
For the date format, you can transform it before writing:
df = df.withColumn("date") date_format(col("date"), "yyyy_MM_dd_HH_mm_ss")
Here is some code to list recursively all part files from the temporary folder and copy them to the destination folder with rename. It uses Hadoop FS, you can adapt it for your case:
conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileUtil = sc._gateway.jvm.org.apache.hadoop.fs.FileUtil
# list all files from staging folder
staging_folder_path = Path("/tmp/folder/")
fs_staging_f = staging_folder_path.getFileSystem(conf)
staging_files = fs_staging_f.listFiles(staging_folder_path, True)
# filter files that have names starting with 'part"
part_files = []
while staging_files.hasNext():
part_file_path = staging_files.next().getPath()
if part_file_path.getName().startswith("part"):
part_files.append(part_file_path)
# add some logic here to flatten the date folders and add the date pattern to filenames
# move files
FileUtil.copy(....)
来源:https://stackoverflow.com/questions/59534674/pyspark-split-dataframe-by-two-columns-without-creating-a-folder-structure-for-t