databricks

How to stop a notebook streaming job gracefully?

泪湿孤枕 提交于 2020-04-16 13:51:46
问题 I have a streaming application which is running into a Databricks notebook job (https://docs.databricks.com/jobs.html). I would like to be able to stop the streaming job gracefully using the stop() method of the StreamingQuery class which is returned by the stream.start() method. That of course requires to either have access to the mentioned streaming instance or to access the context of the running job itself. In this second case the code could look as next: spark.sqlContext.streams.get(

Where is the Delta table location stored?

北战南征 提交于 2020-03-25 21:59:29
问题 We just migrated to Databricks Delta from parquet using Hive metastore. So far everything seems to work fine, when I try to print out the location of the new Delta table using DESCRIBE EXTENDED my_table the location is correct although it is different than the one found in the hiveMetastore database. When I access the hiveMetastore database I can successfully identify the target table (also provider is correctly set to Delta). To retrieve the previous information I am executing a join between

Saving Matplotlib Output to DBFS on Databricks

你。 提交于 2020-03-01 04:41:16
问题 I'm writing Python code on Databricks to process some data and output graphs. I want to be able to save these graphs as a picture file (.png or something, the format doesn't really matter) to DBFS. Code: import pandas as pd import matplotlib.pyplot as plt df = pd.DataFrame({'fruits':['apple','banana'], 'count': [1,2]}) plt.close() df.set_index('fruits',inplace = True) df.plot.bar() # plt.show() Things that I tried: plt.savefig("/FileStore/my-file.png") [Errno 2] No such file or directory: '

How to extract a single (column/row) value from a dataframe using PySpark?

不羁的心 提交于 2020-02-25 22:43:31
问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

How to extract a single (column/row) value from a dataframe using PySpark?

好久不见. 提交于 2020-02-25 22:43:13
问题 Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks! df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data") df

get datatype of column using pyspark

允我心安 提交于 2020-02-17 05:51:08
问题 We are reading data from MongoDB Collection . Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and weight are the columns quantity weight --------- -------- 12300 656 123566000000 789.6767 1238 56.22 345 23 345566677777789 21 Actually we didn't defined data type for any column of mongo collection. When I query to the count from pyspark dataframe

get datatype of column using pyspark

不羁岁月 提交于 2020-02-17 05:51:00
问题 We are reading data from MongoDB Collection . Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and weight are the columns quantity weight --------- -------- 12300 656 123566000000 789.6767 1238 56.22 345 23 345566677777789 21 Actually we didn't defined data type for any column of mongo collection. When I query to the count from pyspark dataframe

Saving spark dataframe from azure databricks' notebook job to azure blob storage causes java.lang.NoSuchMethodError

a 夏天 提交于 2020-02-06 10:14:06
问题 I have created a simple job using notebook in azure databricks. I am trying to save a spark dataframe from notebook to azure blob storage. Attaching the sample code import traceback from pyspark.sql import SparkSession from pyspark.sql.types import StringType # Attached the spark submit command used # spark-submit --master local[1] --packages org.apache.hadoop:hadoop-azure:2.7.2, # com.microsoft.azure:azure-storage:3.1.0 ./write_to_blob_from_spark.py # Tried with com.microsoft.azure:azure

Saving spark dataframe from azure databricks' notebook job to azure blob storage causes java.lang.NoSuchMethodError

痞子三分冷 提交于 2020-02-06 10:04:40
问题 I have created a simple job using notebook in azure databricks. I am trying to save a spark dataframe from notebook to azure blob storage. Attaching the sample code import traceback from pyspark.sql import SparkSession from pyspark.sql.types import StringType # Attached the spark submit command used # spark-submit --master local[1] --packages org.apache.hadoop:hadoop-azure:2.7.2, # com.microsoft.azure:azure-storage:3.1.0 ./write_to_blob_from_spark.py # Tried with com.microsoft.azure:azure

Scala & DataBricks: Getting a list of Files

久未见 提交于 2020-02-04 22:58:26
问题 I am trying to make a list of files in an S3 bucket on Databricks within Scala, and then split by regex. I am very new to Scala. The python equivalent would be all_files = map(lambda x: x.path, dbutils.fs.ls(folder)) filtered_files = filter(lambda name: True if pattern.match(name) else False, all_files) but I want to do this in Scala. From https://alvinalexander.com/scala/how-to-list-files-in-directory-filter-names-scala import java.io.File def getListOfFiles(dir: String):List[File] = { val d