External Table not getting updated from parquet files written by spark streaming

空扰寡人 提交于 2019-12-11 01:26:33

问题


I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :

CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
  path "hdfs:////"
);

I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.

Dropping and recreating the table every time works fine but not a solution.

Please suggest how can my table have the data from newer files also.


回答1:


Are you reading those tables with spark? if so, spark caches parquet tables metadata (since schema discovery can be expensive)

To overcome this, you have 2 options:

  1. Set the config spark.sql.parquet.cacheMetadata to false
  2. refresh the table before the query: sqlContext.refreshTable("my_table")

See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion



来源:https://stackoverflow.com/questions/33812993/external-table-not-getting-updated-from-parquet-files-written-by-spark-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!