databricks | 易学教程

Copy file from dbfs in cluster-scoped init script

阅读更多关于 Copy file from dbfs in cluster-scoped init script

问题 I want to try out cluster scoped init scripts on a Azure Databricks cluster. I'm struggling to see which commands are available. Basically, I've got a file on dbfs that I want to copy to a local directory /tmp/config when the cluster spins up. So I created a very simple bash script: #!/bin/bash mkdir - p /tmp/config databricks fs cp dbfs:/path/to/myFile.conf /tmp/config Spinning up the cluster fails with "Cluster terminated. Reason: Init Script Failure". Looking at the log on dbfs, I see the

Connecting Spark Streaming to Tableau

阅读更多关于 Connecting Spark Streaming to Tableau

问题 I am streaming tweets from a Twitter app to Spark for analysis. I want to output the resulting Spark SQL table to Tableau for real-time analysis locally. I have already tried connecting to Databricks to run the program but I haven't been able to connect the Twitter app to Databricks notebook. My code for writing the steam looks like this: activityQuery = output.writeStream.trigger(processingTime='1 seconds').queryName("Places")\ .format("memory")\ .start() 来源： https://stackoverflow.com

Stream data into Azure Databricks using Event Hubs

阅读更多关于 Stream data into Azure Databricks using Event Hubs

问题 I want to send messages from a Twitter application to an Azure event hub. However, I am getting the an error that says instead of java.util.concurrent.ExecutorService use java.util.concurrent.ScheduledExecutorService . I do not know how to create the EventHubClient.create now. Please help. I am referring to code from the link https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs This is error I am getting: notebook:15: error: type mismatch; found : java.util

Databricks - Structured Streaming: Console Format Displaying Nothing

阅读更多关于 Databricks - Structured Streaming: Console Format Displaying Nothing

问题 I am learning Structured Streaming with Databricks and I'm struggling with the DataStreamWriter console mode. My program: Simulates the streaming arrival of files to the folder "monitoring_dir" (one new file is transferred from "source_dir" each 10 seconds). Uses a DataStreamReader to populate the Unbounded DataFrame "inputUDF" with the content of each new file. Uses a DataStreamWriter to output the new rows of "inputUDF" to a valid sink. Whereas the program works when choosing to use a File

spark_apply works on one dataset, but fails on another (both datasets are of same type and structure)

阅读更多关于 spark_apply works on one dataset, but fails on another (both datasets are of same type and structure)

问题 I am working with sparklyr on databricks. The issue i am facing is that spark_apply() is throwing an error, when i run it on one dataset, but works fine when it is run on another dataset (of same structure and type). Am i missing something? The error message (reproduced) doesnt help much. Simple spark_apply function below: spark_apply(hr2, function(y) y*2) Schema and class of hr2 $LITRES $LITRES$name [1] "LITRES" $LITRES$type [1] "DoubleType" class(hr2) [1] "tbl_spark" "tbl_sql" "tbl_lazy"

How to display and download a pptx file from databricks?

阅读更多关于 How to display and download a pptx file from databricks?

问题 I generated a power point deck with a utility script in databricks using Python. I want to access the file now in the kernel but due to the images in the deck, it shows strange symbols. How do I correct this statement which outputs the deck image? #access file dbutils.fs.head('file:/dbfs/user/test.pptx') Out: 'PK\x03\x04\x14\x00\x00\x00\x08\x00D�lOƯ�g�\x01\x00\x00�\x0c\x00\x00\x13\x00\x00\x00[Content_Types].xml͗�N�0\x10��<E�K\x0e�q�\x175��rb�\x04<�I��-ϴзg�.��R�\n_\x12�3��\'Q4霼�:\x1a�GeM�l�

Write the results of the Google Api to a data lake with Databricks

阅读更多关于 Write the results of the Google Api to a data lake with Databricks

问题 I am getting back user usage data from the Google Admin Report User Usage Api via the Python SDK on Databricks. The data size is around 100 000 records per day which I do a night via a batch process. The api returns a max page size of 1000 so I call it 1000 roughly to get the data I need for the day. This is working fine. My ultimate aim is to store the data in its raw format in a data lake (Azure Gen2, but irrelevant to this question). Later on, I will transform the data using Databricks

Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

阅读更多关于 Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

问题 I am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame. Is this scenario not supported? Spark versions: 2.2.0 and 2.3.0 Libraries used: json-20140107 rxnetty-0.4.20 azure-documentdb-1.14.0 azure-documentdb-rx-0.9.0-rc2 azure-cosmosdb-spark_2.2.0_2.11-1.0.0 rxjava-1.3.0 azure-eventhubs

How to re-direct logs from Azure Databricks to another destination?

阅读更多关于 How to re-direct logs from Azure Databricks to another destination?

问题 We could use some help on how to send Spark Driver and worker logs to a destination outside Azure Databricks, like e.g. Azure Blob storage or Elastic search using Eleastic-beats. When configuring a new cluster, the only options on get reg log delivery destination is dbfs, see https://docs.azuredatabricks.net/user-guide/clusters/log-delivery.html. Any input much appreciated, thanks! 回答1: Maybe the following could be helpful : First you specify a dbfs location for your Spark driver and worker

Parsing datetime from ISO 8601 using Spark SQL

阅读更多关于 Parsing datetime from ISO 8601 using Spark SQL

问题 Want to do this but the other way around. My date s are in this format YYYY-MM-DDThh:mm:ss , I want two columns YYYY-MM-DD and hh:mm that I can concat, if I want to, for certain queries. I get an error when using convert() ; I assume this is not supported currently with Spark SQL. When I use date(datetime) or timestamp(datetime) , I get all null values returned. However, minute(datetime) and hour(datetime) work. Currently, using this concat(date,' ', hour,':', (case when minute < 10 then