databricks

Spark - Reading JSON from Partitioned Folders using Firehose

二次信任 提交于 2020-06-22 11:50:52
问题 Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader? My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe

ModuleNotFoundError: No module named 'pyspark.dbutils'

夙愿已清 提交于 2020-06-17 09:59:11
问题 I am running pyspark from an Azure Machine Learning notebook. I am trying to move a file using the dbutil module. from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() def get_dbutils(spark): try: from pyspark.dbutils import DBUtils dbutils = DBUtils(spark) except ImportError: import IPython dbutils = IPython.get_ipython().user_ns["dbutils"] return dbutils dbutils = get_dbutils(spark) dbutils.fs.cp("file:source", "dbfs:destination") I got this error:

Databricks SQL Server connection across multiple notebooks

不羁岁月 提交于 2020-06-17 09:45:14
问题 I found some resources for how to pass variables across pySpark databricks notebooks. I'm curious if we can pass SQL Server connection, such as having host/database/port/user/pw in Notebook A and calling the connection on Notebook B. 回答1: Take a look at that part of Databricks documentation: https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data. This way you can pass strings, one or multiple, across notebooks, but you'll have to create the connection in Notebook B

Databricks SQL Server connection across multiple notebooks

对着背影说爱祢 提交于 2020-06-17 09:45:10
问题 I found some resources for how to pass variables across pySpark databricks notebooks. I'm curious if we can pass SQL Server connection, such as having host/database/port/user/pw in Notebook A and calling the connection on Notebook B. 回答1: Take a look at that part of Databricks documentation: https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data. This way you can pass strings, one or multiple, across notebooks, but you'll have to create the connection in Notebook B

How to explode an array without duplicate records

為{幸葍}努か 提交于 2020-06-17 09:38:06
问题 This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks. I have the following dataframe which has a array list as a column. +--------------+------------+----------+----------+---+---------+-----------+----------+ customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate | +--------------+------------+----------+----------+---+---------------------+----------+ |A011021 |15 |2020-01-01|2020-12-31|4 |[4,4

Spark Read Json: how to read field that alternates between integer and struct

試著忘記壹切 提交于 2020-06-16 17:27:37
问题 Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct: File 1: { "Value": 123 } File 2: { "Value": { "Value": "On", "ValueType": "State", "IsSystemValue": true } } My goal is to read the files into a dataframe like this: |---------------------|------------------|---------------------|------------------| | File | Value | ValueType | IsSystemValue | |---------------------|------------------|-----------

Spark Read Json: how to read field that alternates between integer and struct

末鹿安然 提交于 2020-06-16 17:26:30
问题 Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct: File 1: { "Value": 123 } File 2: { "Value": { "Value": "On", "ValueType": "State", "IsSystemValue": true } } My goal is to read the files into a dataframe like this: |---------------------|------------------|---------------------|------------------| | File | Value | ValueType | IsSystemValue | |---------------------|------------------|-----------

Does Apache Spark SQL support MERGE clause?

柔情痞子 提交于 2020-06-16 04:34:21
问题 Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause? MERGE into <table> using ( select * from <table1> when matched then update... DELETE WHERE... when not matched then insert... ) 回答1: It does with Delta Lake as storage format : df.write.format("delta").save("/data/events") . DeltaTable.forPath(spark, "/data/events/") .as("events") .merge( updatesDF.as("updates"), "events.eventId = updates.eventId") .whenMatched .updateExpr( Map("data" -> "updates.data"))

Data Governance solution for Databricks, Synapse and ADLS gen2

送分小仙女□ 提交于 2020-06-10 06:45:31
问题 I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities. We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more. Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay

Data Governance solution for Databricks, Synapse and ADLS gen2

落花浮王杯 提交于 2020-06-10 06:43:25
问题 I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities. We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more. Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay