Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

折月煮酒 提交于 2021-02-08 03:59:29

问题


Recently, Databricks launched Databricks Connect that

allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session.

It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this:

spark.read.json("abfss://...").count()

I get this error:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found   at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)

Does anybody know how to fix this?

Further information:

  • databricks-connect version: 5.3.1

回答1:


If you mount the storage rather use a service principal you should find this works: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake-gen2.html

I posted some instructions around the limitations of databricks connect here. https://datathirst.net/blog/2019/3/7/databricks-connect-limitations




回答2:


Likely too late but for completeness' sake, there's one issue to look out for on this one. If you have this spark conf set, you'll see that exact error (which is pretty hard to unpack):

fs.abfss.impl org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem

So you can double check the spark configs to make sure you have the permissions to directly access ADLS gen2 using the storage account access key.



来源:https://stackoverflow.com/questions/56702280/cant-connect-to-azure-data-lake-gen2-using-pyspark-and-databricks-connect

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!