azure-databricks | 易学教程

Pyspark saving is not working when called from inside a foreach

阅读更多关于 Pyspark saving is not working when called from inside a foreach

问题 I am building a pipeline that receives messages from Azure EventHub and save into databricks delta tables. All my tests with static data went well, see the code below: body = 'A|B|C|D\n"False"|"253435564"|"14"|"2019-06-25 04:56:21.713"\n"True"|"253435564"|"13"|"2019-06-25 04:56:21.713"\n" tableLocation = "/delta/tables/myTableName" spark = SparkSession.builder.appName("CSV converter").getOrCreate() csvData = spark.sparkContext.parallelize(body.split('\n')) df = spark.read \ .option("header",

How to re-direct logs from Azure Databricks to another destination?

阅读更多关于 How to re-direct logs from Azure Databricks to another destination?

问题 We could use some help on how to send Spark Driver and worker logs to a destination outside Azure Databricks, like e.g. Azure Blob storage or Elastic search using Eleastic-beats. When configuring a new cluster, the only options on get reg log delivery destination is dbfs, see https://docs.azuredatabricks.net/user-guide/clusters/log-delivery.html. Any input much appreciated, thanks! 回答1: Maybe the following could be helpful : First you specify a dbfs location for your Spark driver and worker

How do you process many files from a blob storage with long paths in databricks?

阅读更多关于 How do you process many files from a blob storage with long paths in databricks?

问题 I've enabled logging for an API Management service and the logs are being stored in a storage account. Now I'm trying to process them in an Azure Databricks workspace but I'm struggling with accessing the files. The issue seems to be that the automatically generated virtual folder structure looks like this: /insights-logs-gatewaylogs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json I

Delete azure sql database rows from azure databricks

阅读更多关于 Delete azure sql database rows from azure databricks

问题 I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe. df.write \ .option('user', jdbcUsername) \ .option('password', jdbcPassword) \ .jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} ) But going forward I don't

Is possible to read an Azure Databricks table from Azure Data Factory?

阅读更多关于 Is possible to read an Azure Databricks table from Azure Data Factory?

问题 I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase. Is it possible to acess databricks tables through Azure Data factory? 回答1: No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here. You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table

How to TRUNCATE and / or use wildcards with Databrick

阅读更多关于 How to TRUNCATE and / or use wildcards with Databrick

问题 I'm trying to write a script in databricks that will select a file based on certain characters in the name of the file or just on the datestamp in the file. For example, the following file looks as follows: LCMS_MRD_Delta_LoyaltyAccount_1992_2018-12-22 06-07-31 I have created the following code in Databricks: import datetime now1 = datetime.datetime.now() now = now1.strftime("%Y-%m-%d") Using the above code I tried to select the file using following: LCMS_MRD_Delta_LoyaltyAccount_1992_%s.csv'

How to write / writeStream each row of a dataframe into a different delta table

阅读更多关于 How to write / writeStream each row of a dataframe into a different delta table

问题 Each row of my dataframe has a CSV content. I am strugling to save each row in a different and specific table. I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working. All the content I managed to find was just like simple prints inside foreachs or codes using .collect() (which I really don't want to use). I also found the repartition way, but that doesn't allow me to choose where each row will go. rows = df.count() df.repartition(rows).write.csv(

How to get SQL database connection compatible with DBI::dbGetQuery function when converting between R script and databricks R notebook?

阅读更多关于 How to get SQL database connection compatible with DBI::dbGetQuery function when converting between R script and databricks R notebook?

问题 I have an R script that uses odbc::dbConnect to connect to an SQL database (some databases are Azure, some are on-premise but connected to the Azure VPNs via the company's network, though I don't have any understanding of the network infrastructure itself) and then uses DBI::dbGetQuery to run a series of fairly complicated SQL queries and store the results as R dataframes which can be manipulated and fed into my models. Because of insufficient memory on my local PC to run the script, I am

Append only new aggregates based on groupby keys

阅读更多关于 Append only new aggregates based on groupby keys

问题 I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id) . So I created a Stream which append only new data into a delta table: operations\ .repartition('date')\ .writeStream\ .outputMode('append')\ .trigger(once=True)\ .option("checkpointLocation", "/mnt/sandbox/operations/_chk")\ .format('delta')\ .partitionBy('date')\ .start('/mnt/sandbox/operations') This is working fine, but i need to summarize this information grouped by (date

“No suitable driver” error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks

阅读更多关于 “No suitable driver” error when using sparklyr::spark_read_jdbc to query Azure SQL database from Azure Databricks

问题 I am trying to connect to an Azure SQL DB from a Databricks notebook using the sparklyr::spark_read_jdbc function. I am an analyst with no computer science background (beyond R and SQL) or previous experience using Spark or jdbc (I have previously used local instances of R to connect to the same SQL database via odbc), so I apologise if I've misunderstood something vital. My code is: sc <- spark_connect(method = "databricks") library(sparklyr) library(dplyr) config <- spark_config() db_tbl <-