azure-databricks

How can I resolve “SparkException: Exception thrown in Future.get” issue?

时光怂恿深爱的人放手 提交于 2021-02-07 09:00:20
问题 I'm working on two pyspark dataframes and doing a left-anti join on them to track everyday changes and then send an email. The first time I tried: diff = Table_a.join( Table_b, [Table_a.col1== Table_b.col1, Table_a.col2== Table_b.col2], how='left_anti' ) Expected output is a pyspark dataframe with some or no data. This diff dataframe gets it's schema from Table_a. The first time I ran it, showed no data as expected with the schema representation. The next time onwards just throws

How to properly access dbutils in Scala when using Databricks Connect

只愿长相守 提交于 2021-02-07 02:46:44
问题 I'm using Databricks Connect to run code in my Azure Databricks cluster locally from IntelliJ IDEA (Scala). Everything works fine. I can connect, debug, inspect locally in the IDE. I created a Databricks Job to run my custom app JAR, but it fails with the following exception: 19/08/17 19:20:26 ERROR Uncaught throwable from user code: java.lang.NoClassDefFoundError: com/databricks/service/DBUtils$ at Main$.<init>(Main.scala:30) at Main$.<clinit>(Main.scala) Line 30 of my Main.scala class is

creating dataframe specific schema : StructField starting with capital letter

拥有回忆 提交于 2021-01-29 09:58:53
问题 Apologies for the lengthy post for a seemingly simple curiosity, but I wanted to give full context... In Databricks, I am creating a "row" of data based on a specific schema definition, and then inserting that row into an empty dataframe (also based on the same specific schema). The schema definition looks like this: myschema_xb = StructType( [ StructField("_xmlns", StringType(), True), StructField("_Version", DoubleType(), True), StructField("MyIds", ArrayType( StructType( [ StructField("_ID

Installing Maven library on Databricks via Python commands and dbutils

只谈情不闲聊 提交于 2021-01-29 08:07:07
问题 On Databricks I would like to install a Maven library through commands in a Python Notebook if its not already installed. If it were a Python PyPI library I would do something like the following: # Get a list of all available library library_name_list = dbutils.library.list() # Suppose the library of interest was "scikit-learn" if "scikit-learn" not in library_name_list: # Install the library dbutils.library.installPyPI("scikit-learn") How can I do the same for a Maven library "com.microsoft

Databricks dbutils throwing NullPointerException

余生长醉 提交于 2021-01-29 07:22:02
问题 Trying to read secret from Azure Key Vault using databricks dbutils, but facing the following exception: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Warning: Ignoring non-Spark config property: eventLog.rolloverIntervalSeconds Exception in thread "main" java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect

Efficient way of reading parquet files between a date range in Azure Databricks

柔情痞子 提交于 2021-01-29 04:31:55
问题 I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date. Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019). Read all data using * wildcard: df = spark.read.parquet

Appending column name to column value using Spark

柔情痞子 提交于 2021-01-28 20:05:44
问题 I have data in comma separated file, I have loaded it in the spark data frame: The data looks like: A B C 1 2 3 4 5 6 7 8 9 I want to transform the above data frame in spark using pyspark as: A B C A_1 B_2 C_3 A_4 B_5 C_6 -------------- Then convert it to list of list using pyspark as: [[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]] And then run FP Growth algorithm using pyspark on the above data set. The code that I have tried is below: from pyspark.sql.functions import col, size from pyspark.sql

Databricks : Equivalent code for SQL query

China☆狼群 提交于 2021-01-28 18:22:16
问题 I'm looking for the equivalent databricks code for the query. I added some sample code and the expected as well, but in particular I'm looking for the equivalent code in Databricks for the query . For the moment I'm stuck on the CROSS APPLY STRING SPLIT part. Sample SQL data: CREATE TABLE FactTurnover ( ID INT, SalesPriceExcl NUMERIC (9,4), Discount VARCHAR(100) ) INSERT INTO FactTurnover VALUES (1, 100, '10'), (2, 39.5877, '58, 12'), (3, 100, '50, 10, 15'), (4, 100, 'B') Query: ;WITH CTE AS

what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

孤街醉人 提交于 2021-01-28 04:09:19
问题 What is the cluster manager used in Databricks? How do I change the number of executors in Databricks clusters ? 回答1: What is the cluster manager used in Databricks? Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes: Fully managed Spark clusters An interactive workspace for exploration and visualization A platform for powering your favorite Spark-based applications The Databricks Runtime is built on top of Apache Spark and is

Azure Databricks to Azure SQL DW: Long text columns

﹥>﹥吖頭↗ 提交于 2021-01-27 08:21:53
问题 I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark: sdf.write \ .format("com.databricks.spark.sqldw") \ .option("forwardSparkAzureStorageCredentials", "true") \ .option("dbTable", "test_table") \ .option("url", url) \ .option("tempDir", temp_dir) \ .save() This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error: Py4JJavaError: An error