How to suppress parquet log messages in Spark?

How to stop such messages from coming on my spark-shell console.

5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 89213 records.
5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 2 ms. row count = 120141
5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 2 ms. row count = 89213
5 May, 2015 5:14:30 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutp
[Stage 12:=================================================>    (184 + 4) / 200]

Thanks

The solution from SPARK-8118 issue comment seem to work:

You can disable the chatty output by creating a properties file with these contents:

org.apache.parquet.handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=SEVERE

And then passing the path of the file to Spark when the application is submitted. Assuming the file lives in /tmp/parquet.logging.properties (of course, that needs to be available on all worker nodes):

spark-submit \
     --conf spark.driver.extraJavaOptions="-Djava.util.logging.config.file=/tmp/parquet.logging.properties" \`
      --conf spark.executor.extraJavaOptions="-Djava.util.logging.config.file=/tmp/parquet.logging.properties" \
      ...

Credits go to Justin Bailey.

I believe this regressed --there are some large merges/changes they are making to the parquet integration...https://issues.apache.org/jira/browse/SPARK-4412

This will work for Spark 2.0. Edit file spark/log4j.properties and add:

log4j.logger.org.apache.spark.sql.execution.datasources.parquet=ERROR
log4j.logger.org.apache.spark.sql.execution.datasources.FileScanRDD=ERROR
log4j.logger.org.apache.hadoop.io.compress.CodecPool=ERROR

The lines for FileScanRDD and CodecPool will help with a couple of logs that are very verbose as well.

I know this question was WRT Spark, but I recently had this issue when using Parquet with Hive in CDH 5.x and found a work-around. Details are here: https://issues.apache.org/jira/browse/SPARK-4412?focusedCommentId=16118403

Contents of my comment from that JIRA ticket below:

This is also an issue in the version of parquet distributed in CDH 5.x. In this case, I am using parquet-1.5.0-cdh5.8.4 (sources available here: http://archive.cloudera.com/cdh5/cdh/5)

However, I've found a work-around for mapreduce jobs submitted via Hive. I'm sure this can be adapted for use with Spark as well.

Add the following properties to your job's configuration (in my case, I added them to hive-site.xml since adding them to mapred-site.xml didn't work:
<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Djava.util.logging.config.file=parquet-logging.properties</value>
</property>
<property>
  <name>mapreduce.reduce.java.opts</name>
  <value>-Djava.util.logging.config.file=parquet-logging.properties</value>
</property>
<property>
  <name>mapreduce.child.java.opts</name>
  <value>-Djava.util.logging.config.file=parquet-logging.properties</value>
</property>
Create a file named parquet-logging.properties with the following contents:
# Note: I'm certain not every line here is necessary. I just added them to cover all possible
# class/facility names.you will want to tailor this as per your needs.
.level=WARNING
java.util.logging.ConsoleHandler.level=WARNING

parquet.handlers=java.util.logging.ConsoleHandler
parquet.hadoop.handlers=java.util.logging.ConsoleHandler
org.apache.parquet.handlers=java.util.logging.ConsoleHandler
org.apache.parquet.hadoop.handlers=java.util.logging.ConsoleHandler

parquet.level=WARNING
parquet.hadoop.level=WARNING
org.apache.parquet.level=WARNING
org.apache.parquet.hadoop.level=WARNING
Add the file to the job. In Hive, this is most easily done like so:
ADD FILE /path/to/parquet-logging.properties;
With this done, when you run your Hive queries, parquet should only log WARNING (and higher) level messages to the stdout container logs.

To turn off all the messages except ERROR, you shoud edit your conf/log4j.properties file changing the following line:

log4j.rootCategory=INFO, console

into

log4j.rootCategory=ERROR, console

Hope it could help!

not a solution but if you build your own spark then this file: https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileReader.java has most the generations of log messages which you can comment out for now.

来源：https://stackoverflow.com/questions/30052889/how-to-suppress-parquet-log-messages-in-spark

标签

logging

apache-spark

parquet