How to refresh a table and do it concurrently?

拜拜、爱过 提交于 2019-12-09 17:33:00

问题


I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.

  1. how to refresh the table?

    Suppose I have some table loaded by

    spark.read.format("").load().createTempView("my_table")

    and it is also cached by

    spark.sql("cache table my_table")

    is it enough with following code to refresh the table, and when the table is loaded next, it will automatically be cached

    spark.sql("refresh table my_table")

    or do I have to do that manually with

    spark.table("my_table").unpersist spark.read.format("").load().createOrReplaceTempView("my_table") spark.sql("cache table my_table")

  2. is it safe to refresh the table concurrently?

    By concurrent I mean using ScheduledThreadPoolExecutor to do the refresh work apart from the main thread.

    What will happen if the Spark is using the cached table when I call refresh on the table?


回答1:


In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.

You can achieve it by using the API,

spark.catalog.refreshTable("my_table")

This API will update the metadata for that table to keep it consistent.




回答2:


I had a problem to read a table from hive using a SparkSession specifically the method table, i.e. spark.table(table_name). Every time after wrote the table and try to read that I got this error:

java.IO.FileNotFoundException ... The underlying files may have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I tried to refresh the table using spark.catalog.refreshTable(table_name) also sqlContext neither worked.

My solutions as wrote the table and after read using:

val usersDF = spark.read.load(s"/path/table_name")

It's work fine.

Is this a problem? Maybe the data at hdfs is not updated yet?



来源:https://stackoverflow.com/questions/45809152/how-to-refresh-a-table-and-do-it-concurrently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!