问题
Spark 2.0 with Hive
Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore
In Spark I would do that like this,
irisDf.write.format("orc")
.mode("overwrite")
.option("path", "s3://my_bucket/iris/")
.saveAsTable("my_database.iris")
In sparklyr I can use the spark_write_tablefunction,
data("iris")
iris_spark <- copy_to(sc, iris, name = "iris")
output <- spark_write_table(
iris
,name = 'my_database.iris'
,mode = 'overwrite'
)
But this doesn't allow me to set path or format
I can also use spark_write_orc
spark_write_orc(
iris
, path = "s3://my_bucket/iris/"
, mode = "overwrite"
)
but it doesn't have the saveAsTable option
Now, I CAN use invoke statements to replicate the Spark code,
sdf <- spark_dataframe(iris_spark)
writer <- invoke(sdf, "write")
writer %>%
invoke('format', 'orc') %>%
invoke('mode', 'overwrite') %>%
invoke('option','path', "s3://my_bucket/iris/") %>%
invoke('saveAsTable',"my_database.iris")
But I am wondering if there is anyway to instead pass the format and path options into spark_write_table or the saveAsTable option into spark_write_orc?
回答1:
path can be set using options argument, which is equivalent to options call in the native DataFrameWriter:
spark_write_table(
iris_spark, name = 'my_database.iris', mode = 'overwrite',
options = list(path = "s3a://my_bucket/iris/")
)
By default in Spark, this will create a table stored as Parquet at path (partition subdirectories can be specified with the partition_by argument).
As of today there is no such option for format, but an easy workaround is to set spark.sessionState.conf.defaultDataSourceName property, either on runtime
spark_session_config(
sc, "spark.sessionState.conf.defaultDataSourceName", "orc"
)
or when you create a session.
来源:https://stackoverflow.com/questions/51886236/sparklyr-can-i-pass-format-and-path-options-into-spark-write-table-or-use-savea