Convert folders structure to partitions on S3 using Spark

问题

I have a lot of data on S3 which are in folder instead of partitions. The structure looks like this:

## s3://bucket/countryname/year/weeknumber/a.csv

s3://Countries/Canada/2019/20/part-1.csv
s3://Countries/Canada/2019/20/part-2.csv
s3://Countries/Canada/2019/20/part-3.csv

s3://Countries/Canada/2019/21/part-1.csv
s3://Countries/Canada/2019/21/part-2.csv

Is there any way to convert that data as parititons. Something like this:

s3://Countries/Country=Canada/Year=2019/Week=20/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-2.csv
s3://Countries/Country=Canada/Year=2019/Week=20/part-3.csv

s3://Countries/Country=Canada/Year=2019/Week=21/part-1.csv
s3://Countries/Country=Canada/Year=2019/Week=21/part-2.csv

I have no clue how to do this, instead of having a for loop which iterates over all the folders and load the data, which is complex.

Any help will be appreciated.

回答1:

Hive style paths isn't always necessary for partitioning. I got to this question from another question you wrote in the context of Athena, so I'm going to guess that the underlying metastore is in fact Glue, and that you're really targeting Athena (I added the amazon-athena tag to your question).

In Presto, or Athena/Glue you can add partitions with for any kind of path, as long as the prefixes don't overlap. For example, you to add the partitions in your first example you would do this:

ALTER TABLE table_name ADD IF NOT EXISTS
  PARTITION (country = 'Canada', year_week = '2019-20') LOCATION 's3://Countries/Canada/2019/20/'
  PARTITION (country = 'Canada', year_week = '2019-21') LOCATION 's3://Countries/Canada/2019/21/'

This assumes there is a year_week column, but you could have year and week as separate columns if you want (and do (country = 'Canada', year = '2019', week = '20')), either works.

Why are almost all Athena examples using Hive style paths (e.g. country=Canada/year=2019/week=20/part-1.csv)? Part of it is for historical reasons, IIRC Hive doesn't support any other scheme, partitioning and paths are tightly coupled. Another reason is that the Athena/Presto command MSCK REPAIR TABLE works only with that style of partitioning (but you want to avoid relying on that command anyway). There are also other tools that assume, or work with that style and no other. If you aren't using those, then it doesn't matter.

If you absolutely must use Hive style partitioning, there is a feature that lets you create "symlinks" to files in a separate path structure. You can find instructions on how to do it here: https://stackoverflow.com/a/55069330/1109 – but keep in mind that this means that you'll have to keep that other path structure up to date. If you don't have to use Hive style paths for your partitions, I would advice that you don't bother with the added complexity.

来源：https://stackoverflow.com/questions/57287621/convert-folders-structure-to-partitions-on-s3-using-spark

标签

bash

apache-spark

amazon-s3

amazon-athena