azure-databricks

Pyspark saving is not working when called from inside a foreach

泄露秘密 提交于 2019-12-24 06:38:28
问题 I am building a pipeline that receives messages from Azure EventHub and save into databricks delta tables. All my tests with static data went well, see the code below: body = 'A|B|C|D\n"False"|"253435564"|"14"|"2019-06-25 04:56:21.713"\n"True"|"253435564"|"13"|"2019-06-25 04:56:21.713"\n" tableLocation = "/delta/tables/myTableName" spark = SparkSession.builder.appName("CSV converter").getOrCreate() csvData = spark.sparkContext.parallelize(body.split('\n')) df = spark.read \ .option("header",

How to re-direct logs from Azure Databricks to another destination?

本小妞迷上赌 提交于 2019-12-24 03:23:49
问题 We could use some help on how to send Spark Driver and worker logs to a destination outside Azure Databricks, like e.g. Azure Blob storage or Elastic search using Eleastic-beats. When configuring a new cluster, the only options on get reg log delivery destination is dbfs, see https://docs.azuredatabricks.net/user-guide/clusters/log-delivery.html. Any input much appreciated, thanks! 回答1: Maybe the following could be helpful : First you specify a dbfs location for your Spark driver and worker

How do you process many files from a blob storage with long paths in databricks?

房东的猫 提交于 2019-12-23 12:47:44
问题 I've enabled logging for an API Management service and the logs are being stored in a storage account. Now I'm trying to process them in an Azure Databricks workspace but I'm struggling with accessing the files. The issue seems to be that the automatically generated virtual folder structure looks like this: /insights-logs-gatewaylogs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json I

Delete azure sql database rows from azure databricks

梦想与她 提交于 2019-12-23 04:54:09
问题 I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe. df.write \ .option('user', jdbcUsername) \ .option('password', jdbcPassword) \ .jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} ) But going forward I don't

Is possible to read an Azure Databricks table from Azure Data Factory?

三世轮回 提交于 2019-12-23 04:24:18
问题 I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase. Is it possible to acess databricks tables through Azure Data factory? 回答1: No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here. You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table

How to TRUNCATE and / or use wildcards with Databrick

空扰寡人 提交于 2019-12-17 21:14:54
问题 I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file. For example, the following file looks as follows: LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31 I have created the following code in Databricks: import datetime now1 = datetime.datetime.now() now = now1.strftime("%Y-%m-%d") Using the above code I tried to select the file using following: LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'

How to write / writeStream each row of a dataframe into a different delta table

南笙酒味 提交于 2019-12-13 18:38:14
问题 Each row of my dataframe has a CSV content. I am strugling to save each row in a different and specific table. I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working. All the content I managed to find was just like simple prints inside foreachs or codes using .collect() (which I really don't want to use). I also found the repartition way, but that doesn't allow me to choose where each row will go. rows = df.count() df.repartition(rows).write.csv(

How to get SQL database connection compatible with DBI::dbGetQuery function when converting between R script and databricks R notebook?

爷,独闯天下 提交于 2019-12-13 03:48:38
问题 I have an R script that uses odbc::dbConnect to connect to an SQL database (some databases are Azure, some are on-premise but connected to the Azure VPNs via the company's network, though I don't have any understanding of the network infrastructure itself) and then uses DBI::dbGetQuery to run a series of fairly complicated SQL queries and store the results as R dataframes which can be manipulated and fed into my models. Because of insufficient memory on my local PC to run the script, I am

Append only new aggregates based on groupby keys

Deadly 提交于 2019-12-11 19:47:21
问题 I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id) . So I created a Stream which append only new data into a delta table: operations\ .repartition('date')\ .writeStream\ .outputMode('append')\ .trigger(once=True)\ .option("checkpointLocation", "/mnt/sandbox/operations/_chk")\ .format('delta')\ .partitionBy('date')\ .start('/mnt/sandbox/operations') This is working fine, but i need to summarize this information grouped by (date

“No suitable driver” error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks

馋奶兔 提交于 2019-12-11 15:54:00
问题 I am trying to connect to an Azure SQL DB from a Databricks notebook using the sparklyr::spark_read_jdbc function. I am an analyst with no computer science background (beyond R and SQL) or previous experience using Spark or jdbc (I have previously used local instances of R to connect to the same SQL database via odbc), so I apologise if I've misunderstood something vital. My code is: sc <- spark_connect(method = "databricks") library(sparklyr) library(dplyr) config <- spark_config() db_tbl <-