databricks | 易学教程

Spark - Reading JSON from Partitioned Folders using Firehose

阅读更多关于 Spark - Reading JSON from Partitioned Folders using Firehose

问题 Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great. How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader? My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe

ModuleNotFoundError: No module named 'pyspark.dbutils'

阅读更多关于 ModuleNotFoundError: No module named 'pyspark.dbutils'

问题 I am running pyspark from an Azure Machine Learning notebook. I am trying to move a file using the dbutil module. from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() def get_dbutils(spark): try: from pyspark.dbutils import DBUtils dbutils = DBUtils(spark) except ImportError: import IPython dbutils = IPython.get_ipython().user_ns["dbutils"] return dbutils dbutils = get_dbutils(spark) dbutils.fs.cp("file:source", "dbfs:destination") I got this error:

Databricks SQL Server connection across multiple notebooks

阅读更多关于 Databricks SQL Server connection across multiple notebooks

问题 I found some resources for how to pass variables across pySpark databricks notebooks. I'm curious if we can pass SQL Server connection, such as having host/database/port/user/pw in Notebook A and calling the connection on Notebook B. 回答1: Take a look at that part of Databricks documentation: https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data. This way you can pass strings, one or multiple, across notebooks, but you'll have to create the connection in Notebook B

Databricks SQL Server connection across multiple notebooks

阅读更多关于 Databricks SQL Server connection across multiple notebooks

How to explode an array without duplicate records

阅读更多关于 How to explode an array without duplicate records

问题 This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks. I have the following dataframe which has a array list as a column. +--------------+------------+----------+----------+---+---------+-----------+----------+ customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate | +--------------+------------+----------+----------+---+---------------------+----------+ |A011021 |15 |2020-01-01|2020-12-31|4 |[4,4

Spark Read Json: how to read field that alternates between integer and struct

阅读更多关于 Spark Read Json: how to read field that alternates between integer and struct

问题 Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct: File 1: { "Value": 123 } File 2: { "Value": { "Value": "On", "ValueType": "State", "IsSystemValue": true } } My goal is to read the files into a dataframe like this: |---------------------|------------------|---------------------|------------------| | File | Value | ValueType | IsSystemValue | |---------------------|------------------|-----------

Spark Read Json: how to read field that alternates between integer and struct

阅读更多关于 Spark Read Json: how to read field that alternates between integer and struct

Does Apache Spark SQL support MERGE clause?

阅读更多关于 Does Apache Spark SQL support MERGE clause?

问题 Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause? MERGE into <table> using ( select * from <table1> when matched then update... DELETE WHERE... when not matched then insert... ) 回答1: It does with Delta Lake as storage format : df.write.format("delta").save("/data/events") . DeltaTable.forPath(spark, "/data/events/") .as("events") .merge( updatesDF.as("updates"), "events.eventId = updates.eventId") .whenMatched .updateExpr( Map("data" -> "updates.data"))

Data Governance solution for Databricks, Synapse and ADLS gen2

阅读更多关于 Data Governance solution for Databricks, Synapse and ADLS gen2

问题 I'm new to data governance, forgive me if question lack some information. Objective We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities. We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more. Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay

Data Governance solution for Databricks, Synapse and ADLS gen2

阅读更多关于 Data Governance solution for Databricks, Synapse and ADLS gen2