问题
Spark 2.0 with Hive
Let's say I am trying to write a spark dataframe, irisDf
to orc and save it to the hive metastore
In Spark I would do that like this,
irisDf.write.format("orc")
.mode("overwrite")
.option("path", "s3://my_bucket/iris/")
.saveAsTable("my_database.iris")
In sparklyr
I can use the spark_write_table
function,
data("iris")
iris_spark <- copy_to(sc, iris, name = "iris")
output <- spark_write_table(
iris
,name = 'my_database.iris'
,mode = 'overwrite'
)
But this doesn't allow me to set path
or format
I can also use spark_write_orc
spark_write_orc(
iris
, path = "s3://my_bucket/iris/"
, mode = "overwrite"
)
but it doesn't have the saveAsTable
option
Now, I CAN use invoke
statements to replicate the Spark code,
sdf <- spark_dataframe(iris_spark)
writer <- invoke(sdf, "write")
writer %>%
invoke('format', 'orc') %>%
invoke('mode', 'overwrite') %>%
invoke('option','path', "s3://my_bucket/iris/") %>%
invoke('saveAsTable',"my_database.iris")
But I am wondering if there is anyway to instead pass the format
and path
options into spark_write_table
or the saveAsTable
option into spark_write_orc
?
回答1:
path
can be set using options
argument, which is equivalent to options
call in the native DataFrameWriter
:
spark_write_table(
iris_spark, name = 'my_database.iris', mode = 'overwrite',
options = list(path = "s3a://my_bucket/iris/")
)
By default in Spark, this will create a table stored as Parquet at path
(partition subdirectories can be specified with the partition_by
argument).
As of today there is no such option for format, but an easy workaround is to set spark.sessionState.conf.defaultDataSourceName
property, either on runtime
spark_session_config(
sc, "spark.sessionState.conf.defaultDataSourceName", "orc"
)
or when you create a session.
来源:https://stackoverflow.com/questions/51886236/sparklyr-can-i-pass-format-and-path-options-into-spark-write-table-or-use-savea