Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

前端 未结 3 1034
忘掉有多难
忘掉有多难 2020-12-08 12:46

I would like to read multiple parquet files into a dataframe from S3. Currently, I\'m using the following method to do this:

files = [\'s3a://dev/2017/01/03/         


        
3条回答
  •  隐瞒了意图╮
    2020-12-08 13:10

    A solution using union

    files = ['s3a://dev/2017/01/03/data.parquet',
             's3a://dev/2017/01/02/data.parquet']
    
    for i, file in enumerate(files):
        act_df = spark.read.parquet(file)   
        if i == 0:
            df = act_df
        else:
            df = df.union(act_df)
    

    An advantage is that it can be done regardless any pattern.

提交回复
热议问题