How to properly access dbutils in Scala when using Databricks Connect

只愿长相守 提交于 2021-02-07 02:46:44

问题


I'm using Databricks Connect to run code in my Azure Databricks cluster locally from IntelliJ IDEA (Scala).

Everything works fine. I can connect, debug, inspect locally in the IDE.

I created a Databricks Job to run my custom app JAR, but it fails with the following exception:

19/08/17 19:20:26 ERROR Uncaught throwable from user code: java.lang.NoClassDefFoundError: com/databricks/service/DBUtils$
at Main$.<init>(Main.scala:30)
at Main$.<clinit>(Main.scala)

Line 30 of my Main.scala class is

val dbutils: DBUtils.type = com.databricks.service.DBUtils

Just like how it's described on this documentation page

That pages shows a way to access DBUtils that works both locally and in the cluster. But the example only shows Python, and I'm using Scala.

What's the proper way to access it in a way that works both locally using databricks-connect and in a Databricks Job running a JAR?

UPDATE

It seems there are two ways of using DBUtils.

1) The DbUtils class described here. Quoting the docs, this library allows you to build and compile the project, but not run it. This doesn't let you run your local code on the cluster.

2) The Databricks Connect described here. This one allows you to run your local Spark code in a Databricks cluster.

The problem is that these two methods have different setups and package name. There doesn't seem to be a way to use Databricks Connect locally (which is not available in the cluster) but then have the jar application using the DbUtils class added via sbt/maven so that the cluster has access to it.


回答1:


I don't know why the docs you mentioned don't work. Maybe you're using a different dependency?

These docs have an example application you can download. It's a project with a very minimal test, so it doesn't create jobs or tries to run them on the cluster -- but it's a start. Also, please note that it uses the older 0.0.1 version of dbutils-api.

So to fix your current issue, instead of using com.databricks.service.DBUtils, try importing the dbutils from a different place:

import com.databricks.dbutils_v1.DBUtilsHolder.dbutils

Or, if you prefer:

import com.databricks.dbutils_v1.{DBUtilsV1, DBUtilsHolder}

type DBUtils = DBUtilsV1
val dbutils: DBUtils = DBUtilsHolder.dbutils

Also, make sure that you have the following dependency in SBT (maybe try to play with versions if 0.0.3 doesn't work -- the latest one is 0.0.4):

libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.3"

This question and answer pointed me in the right direction. The answer contains a link to a working Github repo which uses dbutils: waimak. I hope that this repo could aid you in further questions about Databricks config and dependencies.

Good luck!


UPDATE

I see, so we have two similar but not identical APIs, and no good way to switch between the local and the backend version (though Databricks Connect promises that it should work anyhow). Please let me propose a workaround.

It's good that Scala is convenient for writing adapters. Here's a code snippet which should work as a bridge -- there's the DBUtils object defined in here which provides a sufficient API abstraction for the two versions of the API: the Databricks Connect one on com.databricks.service.DBUtils, and the backend com.databricks.dbutils_v1.DBUtilsHolder.dbutils API. We're able to achieve that by both loading and subsequently using the com.databricks.service.DBUtils through reflection -- we don't have hard-coded imports of it.

package com.example.my.proxy.adapter

import org.apache.hadoop.fs.FileSystem
import org.apache.spark.sql.catalyst.DefinedByConstructorParams

import scala.util.Try

import scala.language.implicitConversions
import scala.language.reflectiveCalls


trait DBUtilsApi {
    type FSUtils
    type FileInfo

    type SecretUtils
    type SecretMetadata
    type SecretScope

    val fs: FSUtils
    val secrets: SecretUtils
}

trait DBUtils extends DBUtilsApi {
    trait FSUtils {
        def dbfs: org.apache.hadoop.fs.FileSystem
        def ls(dir: String): Seq[FileInfo]
        def rm(dir: String, recurse: Boolean = false): Boolean
        def mkdirs(dir: String): Boolean
        def cp(from: String, to: String, recurse: Boolean = false): Boolean
        def mv(from: String, to: String, recurse: Boolean = false): Boolean
        def head(file: String, maxBytes: Int = 65536): String
        def put(file: String, contents: String, overwrite: Boolean = false): Boolean
    }

    case class FileInfo(path: String, name: String, size: Long)

    trait SecretUtils {
        def get(scope: String, key: String): String
        def getBytes(scope: String, key: String): Array[Byte]
        def list(scope: String): Seq[SecretMetadata]
        def listScopes(): Seq[SecretScope]
    }

    case class SecretMetadata(key: String) extends DefinedByConstructorParams
    case class SecretScope(name: String) extends DefinedByConstructorParams
}

object DBUtils extends DBUtils {

    import Adapters._

    override lazy val (fs, secrets): (FSUtils, SecretUtils) = Try[(FSUtils, SecretUtils)](
        (ReflectiveDBUtils.fs, ReflectiveDBUtils.secrets)    // try to use the Databricks Connect API
    ).getOrElse(
        (BackendDBUtils.fs, BackendDBUtils.secrets)    // if it's not available, use com.databricks.dbutils_v1.DBUtilsHolder
    )

    private object Adapters {
        // The apparent code copying here is for performance -- the ones for `ReflectiveDBUtils` use reflection, while
        // the `BackendDBUtils` call the functions directly.

        implicit class FSUtilsFromBackend(underlying: BackendDBUtils.FSUtils) extends FSUtils {
            override def dbfs: FileSystem = underlying.dbfs
            override def ls(dir: String): Seq[FileInfo] = underlying.ls(dir).map(fi => FileInfo(fi.path, fi.name, fi.size))
            override def rm(dir: String, recurse: Boolean = false): Boolean = underlying.rm(dir, recurse)
            override def mkdirs(dir: String): Boolean = underlying.mkdirs(dir)
            override def cp(from: String, to: String, recurse: Boolean = false): Boolean = underlying.cp(from, to, recurse)
            override def mv(from: String, to: String, recurse: Boolean = false): Boolean = underlying.mv(from, to, recurse)
            override def head(file: String, maxBytes: Int = 65536): String = underlying.head(file, maxBytes)
            override def put(file: String, contents: String, overwrite: Boolean = false): Boolean = underlying.put(file, contents, overwrite)
        }

        implicit class FSUtilsFromReflective(underlying: ReflectiveDBUtils.FSUtils) extends FSUtils {
            override def dbfs: FileSystem = underlying.dbfs
            override def ls(dir: String): Seq[FileInfo] = underlying.ls(dir).map(fi => FileInfo(fi.path, fi.name, fi.size))
            override def rm(dir: String, recurse: Boolean = false): Boolean = underlying.rm(dir, recurse)
            override def mkdirs(dir: String): Boolean = underlying.mkdirs(dir)
            override def cp(from: String, to: String, recurse: Boolean = false): Boolean = underlying.cp(from, to, recurse)
            override def mv(from: String, to: String, recurse: Boolean = false): Boolean = underlying.mv(from, to, recurse)
            override def head(file: String, maxBytes: Int = 65536): String = underlying.head(file, maxBytes)
            override def put(file: String, contents: String, overwrite: Boolean = false): Boolean = underlying.put(file, contents, overwrite)
        }

        implicit class SecretUtilsFromBackend(underlying: BackendDBUtils.SecretUtils) extends SecretUtils {
            override def get(scope: String, key: String): String = underlying.get(scope, key)
            override def getBytes(scope: String, key: String): Array[Byte] = underlying.getBytes(scope, key)
            override def list(scope: String): Seq[SecretMetadata] = underlying.list(scope).map(sm => SecretMetadata(sm.key))
            override def listScopes(): Seq[SecretScope] = underlying.listScopes().map(ss => SecretScope(ss.name))
        }

        implicit class SecretUtilsFromReflective(underlying: ReflectiveDBUtils.SecretUtils) extends SecretUtils {
            override def get(scope: String, key: String): String = underlying.get(scope, key)
            override def getBytes(scope: String, key: String): Array[Byte] = underlying.getBytes(scope, key)
            override def list(scope: String): Seq[SecretMetadata] = underlying.list(scope).map(sm => SecretMetadata(sm.key))
            override def listScopes(): Seq[SecretScope] = underlying.listScopes().map(ss => SecretScope(ss.name))
        }
    }
}

object BackendDBUtils extends DBUtilsApi {
    import com.databricks.dbutils_v1

    private lazy val dbutils: DBUtils = dbutils_v1.DBUtilsHolder.dbutils
    override lazy val fs: FSUtils = dbutils.fs
    override lazy val secrets: SecretUtils = dbutils.secrets

    type DBUtils = dbutils_v1.DBUtilsV1
    type FSUtils = dbutils_v1.DbfsUtils
    type FileInfo = com.databricks.backend.daemon.dbutils.FileInfo

    type SecretUtils = dbutils_v1.SecretUtils
    type SecretMetadata = dbutils_v1.SecretMetadata
    type SecretScope = dbutils_v1.SecretScope
}

object ReflectiveDBUtils extends DBUtilsApi {
    // This throws a ClassNotFoundException when the Databricks Connection API isn't available -- it's much better than
    // the NoClassDefFoundError, which we would get if we had a hard-coded import of com.databricks.service.DBUtils .
    // As we're just using reflection, we're able to recover if it's not found.
    private lazy val dbutils: DBUtils =
        Class.forName("com.databricks.service.DBUtils$").getField("MODULE$").get().asInstanceOf[DBUtils]

    override lazy val fs: FSUtils = dbutils.fs
    override lazy val secrets: SecretUtils = dbutils.secrets

    type DBUtils = AnyRef {
        val fs: FSUtils
        val secrets: SecretUtils
    }

    type FSUtils = AnyRef {
        def dbfs: org.apache.hadoop.fs.FileSystem
        def ls(dir: String): Seq[FileInfo]
        def rm(dir: String, recurse: Boolean): Boolean
        def mkdirs(dir: String): Boolean
        def cp(from: String, to: String, recurse: Boolean): Boolean
        def mv(from: String, to: String, recurse: Boolean): Boolean
        def head(file: String, maxBytes: Int): String
        def put(file: String, contents: String, overwrite: Boolean): Boolean
    }

    type FileInfo = AnyRef {
        val path: String
        val name: String
        val size: Long
    }

    type SecretUtils = AnyRef {
        def get(scope: String, key: String): String
        def getBytes(scope: String, key: String): Array[Byte]
        def list(scope: String): Seq[SecretMetadata]
        def listScopes(): Seq[SecretScope]
    }

    type SecretMetadata = DefinedByConstructorParams { val key: String }

    type SecretScope = DefinedByConstructorParams { val name: String }
}

If you replace the val dbutils: DBUtils.type = com.databricks.service.DBUtils which you mentioned in your Main with val dbutils: DBUtils.type = com.example.my.proxy.adapter.DBUtils, everything should work as a drop-in replacement, both locally and remotely.

If you have some new NoClassDefFoundErrors, try adding specific dependencies to the JAR job, or maybe try rearranging them, changing the versions, or marking the dependencies as provided.

This adapter isn't pretty, and it uses reflection, but it should be good enough as a workaround, I hope. Good luck :)




回答2:


To access dbutils.fs and dbutils.secrets Databricks Utilities, you use the DBUtils module.

Example: Accessing DBUtils in scala programing looks like:

val dbutils = com.databricks.service.DBUtils
println(dbutils.fs.ls("dbfs:/"))
println(dbutils.secrets.listScopes())

Reference: Databricks - Accessing DBUtils.

Hope this helps.



来源:https://stackoverflow.com/questions/58941808/how-to-properly-access-dbutils-in-scala-when-using-databricks-connect

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!