databricks

Copy file from dbfs in cluster-scoped init script

僤鯓⒐⒋嵵緔 提交于 2019-12-25 01:40:03
问题 I want to try out cluster scoped init scripts on a Azure Databricks cluster. I'm struggling to see which commands are available. Basically, I've got a file on dbfs that I want to copy to a local directory /tmp/config when the cluster spins up. So I created a very simple bash script: #!/bin/bash mkdir - p /tmp/config databricks fs cp dbfs:/path/to/myFile.conf /tmp/config Spinning up the cluster fails with "Cluster terminated. Reason: Init Script Failure". Looking at the log on dbfs, I see the

Connecting Spark Streaming to Tableau

☆樱花仙子☆ 提交于 2019-12-24 21:16:52
问题 I am streaming tweets from a Twitter app to Spark for analysis. I want to output the resulting Spark SQL table to Tableau for real-time analysis locally. I have already tried connecting to Databricks to run the program but I haven't been able to connect the Twitter app to Databricks notebook. My code for writing the steam looks like this: activityQuery = output.writeStream.trigger(processingTime='1 seconds').queryName("Places")\ .format("memory")\ .start() 来源: https://stackoverflow.com

Stream data into Azure Databricks using Event Hubs

青春壹個敷衍的年華 提交于 2019-12-24 19:29:53
问题 I want to send messages from a Twitter application to an Azure event hub. However, I am getting the an error that says instead of java.util.concurrent.ExecutorService use java.util.concurrent.ScheduledExecutorService . I do not know how to create the EventHubClient.create now. Please help. I am referring to code from the link https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-stream-from-eventhubs This is error I am getting: notebook:15: error: type mismatch; found : java.util

Databricks - Structured Streaming: Console Format Displaying Nothing

廉价感情. 提交于 2019-12-24 15:42:59
问题 I am learning Structured Streaming with Databricks and I'm struggling with the DataStreamWriter console mode. My program: Simulates the streaming arrival of files to the folder "monitoring_dir" (one new file is transferred from "source_dir" each 10 seconds). Uses a DataStreamReader to populate the Unbounded DataFrame "inputUDF" with the content of each new file. Uses a DataStreamWriter to output the new rows of "inputUDF" to a valid sink. Whereas the program works when choosing to use a File

spark_apply works on one dataset, but fails on another (both datasets are of same type and structure)

人盡茶涼 提交于 2019-12-24 12:21:51
问题 I am working with sparklyr on databricks. The issue i am facing is that spark_apply() is throwing an error, when i run it on one dataset, but works fine when it is run on another dataset (of same structure and type). Am i missing something? The error message (reproduced) doesnt help much. Simple spark_apply function below: spark_apply(hr2, function(y) y*2) Schema and class of hr2 $LITRES $LITRES$name [1] "LITRES" $LITRES$type [1] "DoubleType" class(hr2) [1] "tbl_spark" "tbl_sql" "tbl_lazy"

How to display and download a pptx file from databricks?

折月煮酒 提交于 2019-12-24 12:18:28
问题 I generated a power point deck with a utility script in databricks using Python. I want to access the file now in the kernel but due to the images in the deck, it shows strange symbols. How do I correct this statement which outputs the deck image? #access file dbutils.fs.head('file:/dbfs/user/test.pptx') Out: 'PK\x03\x04\x14\x00\x00\x00\x08\x00D�lOƯ�g�\x01\x00\x00�\x0c\x00\x00\x13\x00\x00\x00[Content_Types].xml͗�N�0\x10��<E�K\x0e�q�\x175��rb�\x04<�I����-ϴзg�.��R�\n_\x12�3���\'Q4霼�:\x1a�GeM�l�

Write the results of the Google Api to a data lake with Databricks

a 夏天 提交于 2019-12-24 10:55:14
问题 I am getting back user usage data from the Google Admin Report User Usage Api via the Python SDK on Databricks. The data size is around 100 000 records per day which I do a night via a batch process. The api returns a max page size of 1000 so I call it 1000 roughly to get the data I need for the day. This is working fine. My ultimate aim is to store the data in its raw format in a data lake (Azure Gen2, but irrelevant to this question). Later on, I will transform the data using Databricks

Spark CosmosDB Sink: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

こ雲淡風輕ζ 提交于 2019-12-24 10:11:38
问题 I am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error: org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame. Is this scenario not supported? Spark versions: 2.2.0 and 2.3.0 Libraries used: json-20140107 rxnetty-0.4.20 azure-documentdb-1.14.0 azure-documentdb-rx-0.9.0-rc2 azure-cosmosdb-spark_2.2.0_2.11-1.0.0 rxjava-1.3.0 azure-eventhubs

How to re-direct logs from Azure Databricks to another destination?

本小妞迷上赌 提交于 2019-12-24 03:23:49
问题 We could use some help on how to send Spark Driver and worker logs to a destination outside Azure Databricks, like e.g. Azure Blob storage or Elastic search using Eleastic-beats. When configuring a new cluster, the only options on get reg log delivery destination is dbfs, see https://docs.azuredatabricks.net/user-guide/clusters/log-delivery.html. Any input much appreciated, thanks! 回答1: Maybe the following could be helpful : First you specify a dbfs location for your Spark driver and worker

Parsing datetime from ISO 8601 using Spark SQL

耗尽温柔 提交于 2019-12-23 21:18:09
问题 Want to do this but the other way around. My date s are in this format YYYY-MM-DDThh:mm:ss , I want two columns YYYY-MM-DD and hh:mm that I can concat, if I want to, for certain queries. I get an error when using convert() ; I assume this is not supported currently with Spark SQL. When I use date(datetime) or timestamp(datetime) , I get all null values returned. However, minute(datetime) and hour(datetime) work. Currently, using this concat(date,' ', hour,':', (case when minute < 10 then