问题
I am using AWS Glue jobs to backup dynamodb tables in s3 in parquet format to be able to use it in Athena.
If I want to use these parquet format s3 files to be able to do restore of the table in dynamodb, this is what I am thinking - read each parquet file and convert it into json and then insert the json formatted data into dynamodb (using pyspark on the below lines)
# set sql context
parquetFile = sqlContext.read.parquet(input_file)
parquetFile.write.json(output_path)
Convert normal json to dynamo expected json using - https://github.com/Alonreznik/dynamodb-json
Does this approach sound right? Are there any other alternatives to this approach?
回答1:
Your approach will work, but you can write directly to DynamoDB. You just need to import a few jar
s when you run pyspark
. Have a look at this:
https://github.com/audienceproject/spark-dynamodb
Hope this helps.
回答2:
You can use AWS Glue to directly convert Parquet format into JSON, then create a lambda function that triggers on S3 put and load into DyanmoDB
https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f
来源:https://stackoverflow.com/questions/59518748/convert-parquet-to-json-for-dynamodb-import