databricks | 易学教程

How to properly access dbutils in Scala when using Databricks Connect

阅读更多关于 How to properly access dbutils in Scala when using Databricks Connect

问题 I'm using Databricks Connect to run code in my Azure Databricks cluster locally from IntelliJ IDEA (Scala). Everything works fine. I can connect, debug, inspect locally in the IDE. I created a Databricks Job to run my custom app JAR, but it fails with the following exception: 19/08/17 19:20:26 ERROR Uncaught throwable from user code: java.lang.NoClassDefFoundError: com/databricks/service/DBUtils$ at Main$.<init>(Main.scala:30) at Main$.<clinit>(Main.scala) Line 30 of my Main.scala class is

Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

阅读更多关于 Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

问题 My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks. I have worked out part of it and now stuck by another problem. I have a small pyspark dataframe like : df1: +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ |topic| termIndices| termWeights| terms| +-----+---------------------------

Getting error on connecting to a local SQL Server database to databricks via JDBC connection

阅读更多关于 Getting error on connecting to a local SQL Server database to databricks via JDBC connection

问题 Basically I'm trying to connect to a SQL Server database on my local machine from databricks using a JDBC connection. I'm following the procedure mentioned in the documentation as shown here on the databricks website. I used the following code as mentioned on the website: jdbcHostname = "localhost" jdbcDatabase = "TestDB" jdbcPort = "3306" jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase) connectionProperties = { "jdbcUsername" : "user1", "jdbcPassword" :

Read CSV file in pyspark with ANSI encoding

阅读更多关于 Read CSV file in pyspark with ANSI encoding

问题 I am trying to read in a csv/text file that requires it to be read in using ANSI encoding. However this is not working. Any ideas? mainDF= spark.read.format("csv")\ .option("encoding","ANSI")\ .option("header","true")\ .option("maxRowsInMemory",1000)\ .option("inferSchema","false")\ .option("delimiter", "¬")\ .load(path) java.nio.charset.UnsupportedCharsetException: ANSI The file is over 5GB hence the spark requirement. I have also tried ANSI in lower case 回答1: ISO-8859-1 is the same as ANSI

how to enable GPU visible for ML runtime environment on databricks?

阅读更多关于 how to enable GPU visible for ML runtime environment on databricks?

问题 I am trying to run some TensorFlow (2.2) example code on databricks/GPU (p2.xlarge) with environment as: 6.6 ML, spark 2.4.5, GPU, Scala 2.11 Keras version : 2.2.5 nvidia-smi NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 I have checked https://docs.databricks.com/applications/deep-learning/single-node-training/tensorflow.html#install-tensorflow-22-on-databricks-runtime-66-ml&language-GPU But, I do not want to run the shell commands every time the databricks GPU clusters is

Installing Maven library on Databricks via Python commands and dbutils

阅读更多关于 Installing Maven library on Databricks via Python commands and dbutils

问题 On Databricks I would like to install a Maven library through commands in a Python Notebook if its not already installed. If it were a Python PyPI library I would do something like the following: # Get a list of all available library library_name_list = dbutils.library.list() # Suppose the library of interest was "scikit-learn" if "scikit-learn" not in library_name_list: # Install the library dbutils.library.installPyPI("scikit-learn") How can I do the same for a Maven library "com.microsoft

Databricks dbutils throwing NullPointerException

阅读更多关于 Databricks dbutils throwing NullPointerException

问题 Trying to read secret from Azure Key Vault using databricks dbutils, but facing the following exception: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds Exception in thread "main" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect

Factor Analysis using sparklyr in Databricks

阅读更多关于 Factor Analysis using sparklyr in Databricks

问题 I would like to perform a Factor Analysis by using dplyr::collect() in Databricks but because of its size I am getting this error: Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 82.0 B Is there a function in sparklyr using which I can do this analysis without collecting the data? 来源： https://stackoverflow.com/questions/64113459/factor-analysis-using-sparklyr-in

Efficient way of reading parquet files between a date range in Azure Databricks

阅读更多关于 Efficient way of reading parquet files between a date range in Azure Databricks

问题 I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date. Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019). Read all data using * wildcard: df = spark.read.parquet

Databricks : Equivalent code for SQL query

阅读更多关于 Databricks : Equivalent code for SQL query

问题 I'm looking for the equivalent databricks code for the query. I added some sample code and the expected as well, but in particular I'm looking for the equivalent code in Databricks for the query . For the moment I'm stuck on the CROSS APPLY STRING SPLIT part. Sample SQL data: CREATE TABLE FactTurnover ( ID INT, SalesPriceExcl NUMERIC (9,4), Discount VARCHAR(100) ) INSERT INTO FactTurnover VALUES (1, 100, '10'), (2, 39.5877, '58, 12'), (3, 100, '50, 10, 15'), (4, 100, 'B') Query: ;WITH CTE AS