问题
When running spark-shell
it creates a file derby.log
and a folder metastore_db
. How do I configure spark to put these somewhere else?
For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null"
with a couple of different properties but spark ignores them.
Does anyone know how to get rid of these or specify a default directory for them?
回答1:
The use of the hive.metastore.warehouse.dir
is deprecated since Spark 2.0.0,
see the docs.
As hinted by this answer, the real culprit for both the metastore_db
directory and the derby.log
file being created in every working subdirectory is the derby.system.home
property defaulting to .
.
Thus, a default location for both can be specified by adding the following line to spark-defaults.conf
:
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby
where /tmp/derby
can be replaced by the directory of your choice.
回答2:
For spark-shell, to avoid having the metastore_db
directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml
file and copy this file into spark conf directory.
A sample hive-site.xml
file to make the location of metastore_db
in /tmp
(refer to my answer here):
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/tmp/</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
After that you could start your spark-shell
as the following to get rid of derby.log
as well
$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"
回答3:
Try setting derby.system.home
to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .
Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html
回答4:
Use hive.metastore.warehouse.dir
property. From docs:
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
For derby log: Getting rid of derby.log could be the answer. In general create derby.properties
file in your working directory with following content:
derby.stream.error.file=/path/to/desired/log/file
回答5:
In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local[*]")
.set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
)
sc = SparkContext(conf = conf)
来源:https://stackoverflow.com/questions/38377188/how-to-get-rid-of-derby-log-metastore-db-from-spark-shell