How to write streaming dataframe to PostgreSQL?

回眸只為那壹抹淺笑 提交于 2019-12-20 04:36:30

问题


I have a streaming dataframe that I am trying to write into a database. There is documentation for writing an rdd or df into Postgres. But, I am unable to find examples or documentation on how it is done in Structured streaming.

I have read the documentation https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#foreachbatch , but I couldn't understand where I would create a jdbc connection and how I would write it to the database.

def foreach_batch_function(df, epoch_id):
    # what goes in here?
    pass

view_counts_query = windowed_view_counts.writeStream \
    .outputMode("append") \
    .foreachBatch(foreach_batch_function)
    .option("truncate", "false") \
    .trigger(processingTime="5 seconds") \
    .start() \
    .awaitTermination()

This function takes in a regular dataframe and writes into a postgres table

def postgres_sink(config, data_frame):
    config.read('/src/config/config.ini')
    dbname = config.get('dbauth', 'dbname')
    dbuser = config.get('dbauth', 'user')
    dbpass = config.get('dbauth', 'password')
    dbhost = config.get('dbauth', 'host')
    dbport = config.get('dbauth', 'port')

    url = "jdbc:postgresql://"+dbhost+":"+dbport+"/"+dbname
    properties = {
        "driver": "org.postgresql.Driver",
        "user": dbuser,
        "password": dbpass
    }

    data_frame.write.jdbc(url=url, table="metrics", mode="append",
                          properties=properties)

回答1:


There is really little be done here, beyond what you already have. foreachBatch takes a function (DataFrame, Int) => None, so all you need is a small adapter, and everything else should work just fine:

def foreach_batch_for_config(config)
    def _(df, epoch_id):
        postgres_sink(config, df)
   return _

view_counts_query = (windowed_view_counts
    .writeStream
    .outputMode("append") 
    .foreachBatch(foreach_batch_for_config(some_config))
    ...,
    .start()
    .awaitTermination())

though to be honest passing ConfigParser around is a strange idea from the beginning. You could adjust the signature adn initialize it in place

def postgres_sink(data_frame, batch_id):
    config = configparser.ConfigParser()
    ...
    data_frame.write.jdbc(...)

and keep the rest as-is. This way you could use your function directly:

...
.foreachBatch(postgres_sink)
...


来源:https://stackoverflow.com/questions/54756840/how-to-write-streaming-dataframe-to-postgresql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!