How to get rid of derby.log, metastore_db from Spark Shell

孤街浪徒 提交于 2019-11-26 19:37:54

问题


When running spark-shell it creates a file derby.log and a folder metastore_db. How do I configure spark to put these somewhere else?

For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null" with a couple of different properties but spark ignores them.

Does anyone know how to get rid of these or specify a default directory for them?


回答1:


The use of the hive.metastore.warehouse.dir is deprecated since Spark 2.0.0, see the docs.

As hinted by this answer, the real culprit for both the metastore_db directory and the derby.log file being created in every working subdirectory is the derby.system.home property defaulting to ..

Thus, a default location for both can be specified by adding the following line to spark-defaults.conf:

spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby

where /tmp/derby can be replaced by the directory of your choice.




回答2:


For spark-shell, to avoid having the metastore_db directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml file and copy this file into spark conf directory.
A sample hive-site.xml file to make the location of metastore_db in /tmp (refer to my answer here):

<configuration>
   <property>
     <name>javax.jdo.option.ConnectionURL</name>
     <value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
     <description>JDBC connect string for a JDBC metastore</description>
   </property>
   <property>
     <name>javax.jdo.option.ConnectionDriverName</name>
     <value>org.apache.derby.jdbc.EmbeddedDriver</value>
     <description>Driver class name for a JDBC metastore</description>
   </property>
   <property>
      <name>hive.metastore.warehouse.dir</name>
      <value>/tmp/</value>
      <description>location of default database for the warehouse</description>
   </property>
</configuration>

After that you could start your spark-shell as the following to get rid of derby.log as well

$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"



回答3:


Try setting derby.system.home to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .

Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html




回答4:


Use hive.metastore.warehouse.dir property. From docs:

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

For derby log: Getting rid of derby.log could be the answer. In general create derby.properties file in your working directory with following content:

derby.stream.error.file=/path/to/desired/log/file



回答5:


In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:

from pyspark import SparkConf, SparkContext

conf = (SparkConf()
    .setMaster("local[*]")
    .set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
   )

sc = SparkContext(conf = conf)


来源:https://stackoverflow.com/questions/38377188/how-to-get-rid-of-derby-log-metastore-db-from-spark-shell

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!