How to write PySpark dataframe to DynamoDB table?

此生再无相见时 提交于 2020-08-19 10:50:27

问题


How to write PySpark dataframe to DynamoDB table? Did not find much info on this. As per my requirement, i have to write PySpark dataframe to Dynamo db table. Overall i need to read/write to dynamo from my PySpark code.

Thanks in advance.


回答1:


Ram, there's no way to do that directly from pyspark. If you have pipeline software running it can be done in a series of steps. Here is how it can be done:

  1. Create a temporary hive table like

    CREATE TABLE TEMP( column1 type, column2 type...) STORED AS ORC;

  2. Run your pySpark job and write your data to it

    dataframe.createOrReplaceTempView("df") spark.sql("INSERT OVERWRITE TABLE temp SELECT * FROM df")

  3. Create the dynamo connector table

    CREATE TABLE TEMPTODYNAMO( column1 type, column2 type...) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "temp-to-dynamo", "dynamodb.column.mapping" = "column1:column1,column2:column2...";

  4. Overwrite that table with your temp table

    INSERT OVERWRITE TABLE TEMPTODYNAMO SELECT * FROM TEMP;

More info here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Commands.html



来源:https://stackoverflow.com/questions/53044026/how-to-write-pyspark-dataframe-to-dynamodb-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!