azure-databricks

Is there a way to load multiple text files into a single dataframe using Databricks?

匆匆过客 提交于 2019-12-11 14:32:09
问题 I am trying to test a few ideas to recursively loop through all files in a folder and sub-folders, and load everything into a single dataframe. I have 12 different kinds of files, and the differences are based on the file naming conventions. So, I have file names that start with 'ABC', file names that start with 'CN', file names that start with 'CZ', and so on. I tried the following 3 ideas. import pyspark import os.path from pyspark.sql import SQLContext from pyspark.sql.functions import

Azure DataBricks Stream foreach fails with NotSerializableException

非 Y 不嫁゛ 提交于 2019-12-11 14:17:09
问题 I want to continuously elaborate rows of a dataset stream (originally initiated by a Kafka): based on a condition I want to update a Radis hash. This is my code snippet ( lastContacts is the result of a previous command, which is a stream of this type: org.apache.spark.sql.DataFrame = [serialNumber: string, lastModified: long] . This expands to org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] ): class MyStreamProcessor extends ForeachWriter[Row] { override def open(partitionId: Long,

How to setup a starting point for the batchId of foreachBatch?

我们两清 提交于 2019-12-11 14:16:59
问题 The problem that I am facing is that my process relies on the batchId of the foreachBatch as some sort of control of what is ready to the second stage of the pipeline. So it wil only go to the second stage if the first stage (batch) is completed. I want to guarantee that in case of something goes wrong, the stream can continue from where it stopped. We tried to do some control by adding all completed batchs to a delta table, however, I couldn't find a way to set the initial batchId. 回答1:

Databricks read Azure blob last modified date

假装没事ソ 提交于 2019-12-11 09:55:14
问题 I have an Azure blob storage mounted to my Databricks hdfs. Is there a way to get the last modified date of the blob in databricks? This is how i'm reading the blob content: val df = spark.read .option("header", "false") .option("inferSchema", "false") .option("delimiter", ",") .csv("/mnt/test/*") 回答1: Generally, there are two ways to read an Azure Blob last modified data, as below. Directly read it via Azure Storage REST API or Azure Storage SDK for Java. After I researched Azure Blob

How to run SQL statement from Databricks cluster

你离开我真会死。 提交于 2019-12-11 07:05:27
问题 I have an Azure Databricks cluster that processes various tables and then as a final step I push these table into an Azure SQL Server to be used by some other processes. I have a cell in databricks that looks something like this: def generate_connection(): jdbcUsername = dbutils.secrets.get(scope = "Azure-Key-Vault-Scope", key = "AzureSqlUserName") jdbcPassword = dbutils.secrets.get(scope = "Azure-Key-Vault-Scope", key = "AzureSqlPassword") connectionProperties = { "user" : jdbcUsername,

Why “databricks-connect test” does not work after configurate Databricks Connect?

て烟熏妆下的殇ゞ 提交于 2019-12-11 03:11:47
问题 I wanna run my Spark processes directly in my cluster using IntelliJ IDEA, so I'm following the next documentation https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html After configuring all, I run databricks-connect test but I'm not obtained the Scala REPL as the documentation says. That is my cluster configuration 回答1: You problem looks like it is one of the following: a) You specificed the wrong port (it has to be 8787 on Azure) b) You didnt open up the port in you

streaming aggregate not writing into sink

情到浓时终转凉″ 提交于 2019-12-08 06:05:12
问题 I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table: operations\ .repartition('date')\ .writeStream\ .outputMode('append')\ .trigger(once=True)\ .option("checkpointLocation", "/mnt/sandbox/operations/_chk")\ .format('delta')\ .partitionBy('date')\ .start('/mnt/sandbox/operations') This is working fine, but i need to summarize this information grouped by (date

Is Managed Resource Group Mandatory for creating Azure Databricks

好久不见. 提交于 2019-12-07 23:54:44
问题 while creating Azure Databricks, the managed resource group is getting created automatically with resources(vnet,nsg and storage account). My question is, is it possible to create Azure Databricks without managed resource group. If not can we use our existing resources(like vnet, nsg and storage account) I have tried creating Azure Databricks with rest APi with empty managed resource group. But i am not able to sign in while launching workspace. 回答1: The managed resource group must exist as

Create External table in Azure databricks

痞子三分冷 提交于 2019-12-07 11:55:33
问题 I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location. From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created. Note : One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount

Generate Azure Databricks Token using Powershell script

旧街凉风 提交于 2019-12-06 05:41:20
问题 I need to generate Azure Databricks token using Powershell script. I am done with creation of Azure Databricks using ARM template , now i am looking to generate Databricks token using powershell script . Kindly let me know how to create Databricks token using Powershell script 回答1: The only way to generate a new token is via the api which requires you to have a token in the first place. Or use the Web ui manually. There is no official powershell commands for databricks, there are some