How to write PySpark dataframe to DynamoDB table? Did not find much info on this. As per my requirement, i have to write PySpark dataframe to Dynamo db table. Overall i need
Ram, there's no way to do that directly from pyspark. If you have pipeline software running it can be done in a series of steps. Here is how it can be done:
Create a temporary hive table like
CREATE TABLE TEMP(
column1 type,
column2 type...)
STORED AS ORC;
Run your pySpark job and write your data to it
dataframe.createOrReplaceTempView("df")
spark.sql("INSERT OVERWRITE TABLE temp SELECT * FROM df")
Create the dynamo connector table
CREATE TABLE TEMPTODYNAMO(
column1 type,
column2 type...)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "temp-to-dynamo",
"dynamodb.column.mapping" = "column1:column1,column2:column2...";
Overwrite that table with your temp table
INSERT OVERWRITE TABLE TEMPTODYNAMO SELECT * FROM TEMP;
More info here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Commands.html