问题
Consider a code;
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val path = ...
val dataFrame:DataFramew = ...
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
dataFrame.createOrReplaceTempView("my_table")
val results = hiveContext.sql(s"select * from my_table")
results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
hiveContext.sql("REFRESH TABLE my_table")
This code is executed twice with same path but different dataFrames. The first run is successful, but subsequent rise en error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I have tried to clean up cache, invoke hiveContext.dropTempTable("tableName") and all have no effect. When to call REFRESH TABLE tableName before, after (other variants) to repair such error?
回答1:
For the Googlers;
You can run spark.catalog.refreshTable(tableName) or spark.sql(s"REFRESH TABLE $tableName") just before the write operation. I had same problem and it fixed my problem.
spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)
来源:https://stackoverflow.com/questions/49234471/when-to-execute-refresh-table-my-table-in-spark