databricks

How to properly access dbutils in Scala when using Databricks Connect

只愿长相守 提交于 2021-02-07 02:46:44
问题 I'm using Databricks Connect to run code in my Azure Databricks cluster locally from IntelliJ IDEA (Scala). Everything works fine. I can connect, debug, inspect locally in the IDE. I created a Databricks Job to run my custom app JAR, but it fails with the following exception: 19/08/17 19:20:26 ERROR Uncaught throwable from user code: java.lang.NoClassDefFoundError: com/databricks/service/DBUtils$ at Main$.<init>(Main.scala:30) at Main$.<clinit>(Main.scala) Line 30 of my Main.scala class is

Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

三世轮回 提交于 2021-01-29 18:10:15
问题 My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks. I have worked out part of it and now stuck by another problem. I have a small pyspark dataframe like : df1: +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ |topic| termIndices| termWeights| terms| +-----+---------------------------

Getting error on connecting to a local SQL Server database to databricks via JDBC connection

房东的猫 提交于 2021-01-29 17:40:34
问题 Basically I'm trying to connect to a SQL Server database on my local machine from databricks using a JDBC connection. I'm following the procedure mentioned in the documentation as shown here on the databricks website. I used the following code as mentioned on the website: jdbcHostname = "localhost" jdbcDatabase = "TestDB" jdbcPort = "3306" jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase) connectionProperties = { "jdbcUsername" : "user1", "jdbcPassword" :

Read CSV file in pyspark with ANSI encoding

荒凉一梦 提交于 2021-01-29 13:25:54
问题 I am trying to read in a csv/text file that requires it to be read in using ANSI encoding. However this is not working. Any ideas? mainDF= spark.read.format("csv")\ .option("encoding","ANSI")\ .option("header","true")\ .option("maxRowsInMemory",1000)\ .option("inferSchema","false")\ .option("delimiter", "¬")\ .load(path) java.nio.charset.UnsupportedCharsetException: ANSI The file is over 5GB hence the spark requirement. I have also tried ANSI in lower case 回答1: ISO-8859-1 is the same as ANSI

how to enable GPU visible for ML runtime environment on databricks?

隐身守侯 提交于 2021-01-29 08:32:10
问题 I am trying to run some TensorFlow (2.2) example code on databricks/GPU (p2.xlarge) with environment as: 6.6 ML, spark 2.4.5, GPU, Scala 2.11 Keras version : 2.2.5 nvidia-smi NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 I have checked https://docs.databricks.com/applications/deep-learning/single-node-training/tensorflow.html#install-tensorflow-22-on-databricks-runtime-66-ml&language-GPU But, I do not want to run the shell commands every time the databricks GPU clusters is

Installing Maven library on Databricks via Python commands and dbutils

只谈情不闲聊 提交于 2021-01-29 08:07:07
问题 On Databricks I would like to install a Maven library through commands in a Python Notebook if its not already installed. If it were a Python PyPI library I would do something like the following: # Get a list of all available library library_name_list = dbutils.library.list() # Suppose the library of interest was "scikit-learn" if "scikit-learn" not in library_name_list: # Install the library dbutils.library.installPyPI("scikit-learn") How can I do the same for a Maven library "com.microsoft

Databricks dbutils throwing NullPointerException

余生长醉 提交于 2021-01-29 07:22:02
问题 Trying to read secret from Azure Key Vault using databricks dbutils, but facing the following exception: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds Exception in thread "main" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect

Factor Analysis using sparklyr in Databricks

丶灬走出姿态 提交于 2021-01-29 06:13:50
问题 I would like to perform a Factor Analysis by using dplyr::collect() in Databricks but because of its size I am getting this error: Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 82.0 B Is there a function in sparklyr using which I can do this analysis without collecting the data? 来源: https://stackoverflow.com/questions/64113459/factor-analysis-using-sparklyr-in

Efficient way of reading parquet files between a date range in Azure Databricks

柔情痞子 提交于 2021-01-29 04:31:55
问题 I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date. Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019). Read all data using * wildcard: df = spark.read.parquet

Databricks : Equivalent code for SQL query

China☆狼群 提交于 2021-01-28 18:22:16
问题 I'm looking for the equivalent databricks code for the query. I added some sample code and the expected as well, but in particular I'm looking for the equivalent code in Databricks for the query . For the moment I'm stuck on the CROSS APPLY STRING SPLIT part. Sample SQL data: CREATE TABLE FactTurnover ( ID INT, SalesPriceExcl NUMERIC (9,4), Discount VARCHAR(100) ) INSERT INTO FactTurnover VALUES (1, 100, '10'), (2, 39.5877, '58, 12'), (3, 100, '50, 10, 15'), (4, 100, 'B') Query: ;WITH CTE AS