Unable to configure ORC properties in Spark

江枫思渺然 提交于 2019-12-30 03:34:06

问题


I am using Spark 1.6 (Cloudera 5.8.2) and tried below methods to configure ORC properties. But it does not effect output.

Below is the code snippet i tried.

 DataFrame dataframe =
                hiveContext.createDataFrame(rowData, schema);
dataframe.write().format("orc").options(new HashMap(){
            {

                put("orc.compress","SNAPPY");
                put("hive.exec.orc.default.compress","SNAPPY");

                put("orc.compress.size","524288");
                put("hive.exec.orc.default.buffer.size","524288");


                put("hive.exec.orc.compression.strategy", "COMPRESSION");

            }
        }).save("spark_orc_output");

Apart from this, i tried these properties set in hive-site.xml and hiveContext object also.

hive --orcfiledump on output confirms that the configurations not applied. Orcfiledump snippet is below.

Compression: ZLIB
Compression size: 262144

回答1:


You are making two different errors here. I don't blame you; I've been there...

Issue #1
orc.compress and the rest are not Spark DataFrameWriter options. They are Hive configuration properties, that must be defined before creating the hiveContext object...

  • either in the hive-site.xml available to Spark at launch time
  • or in your code, by re-creating the SparkContext...

 sc.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
 sc.stop
 val scAlt = new org.apache.spark.SparkContext((new org.apache.spark.SparkConf).set("orc.compress","snappy"))
 scAlt.getConf.get("orc.compress","<undefined>") // will now be Snappy
 val hiveContextAlt = new org.apache.spark.sql.SQLContext(scAlt)

[Edit] with Spark 2.x the script would become...
 spark.sparkContext.getConf.get("orc.compress","<undefined>") // depends on Hadoop conf
 spark.close
 val sparkAlt = org.apache.spark.sql.SparkSession.builder().config("orc.compress","snappy").getOrCreate()
 sparkAlt.sparkContext.getConf.get("orc.compress","<undefined>") // will now be Snappy

Issue #2
Spark uses its own SerDe libraries for ORC (and Parquet, JSON, CSV, etc) so it does not have to honor the standard Hadoop/Hive properties.

There are some Spark-specific properties for Parquet, and they are well documented. But again, these properties must be set before creating (or re-creating) the hiveContext.

For ORC and the other formats, you have to resort to format-specific DataFrameWriter options; quoting the latest JavaDoc...

You can set the following ORC-specific option(s) for writing ORC files:
compression (default snappy): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, and lzo). This will override orc.compress

Note that the default compression codec has changed with Spark 2; before that it was zlib

So the only thing you can set is the compression codec, using

dataframe.write().format("orc").option("compression","snappy").save("wtf")


来源:https://stackoverflow.com/questions/41756775/unable-to-configure-orc-properties-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!