databricks

How to use Databricks Job Spark Configuration spark_conf?

半世苍凉 提交于 2020-06-09 05:49:08
问题 I have a sample Spark Code where I am trying to access the Values for tables from the Spark Configurations provided by spark_conf Option by using the typeSafe application.conf and Spark Conf in the Databricks UI. The code I am using is below, When I hit the Run Button in the Databricks UI, the job is finishing successfully, but the println function is printing dummyValue instead of ThisIsTableAOne,ThisIsTableBOne... I can see from the Spark UI that, the Configurations for TableNames are being

Import a GitHub repo into Databricks community edition

故事扮演 提交于 2020-06-01 05:38:45
问题 I am trying to import some data from a public repo in GitHub so that to use it from my Databricks notebooks. So far I tried to connect my Databricks account with my GitHub as described here, without results though since it seems that GitHub support comes with some non-community licensing. I get the following message when I try to set the GitHub token which is required for the GitHub integration: The same question has been asked before on the official Databricks forum. What is the best way to

How can I read a XML file Azure Databricks Spark

一个人想着一个人 提交于 2020-05-27 05:19:02
问题 I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances. So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW. Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it

How can I read a XML file Azure Databricks Spark

喜你入骨 提交于 2020-05-27 05:17:46
问题 I was looking for some info on the MSDN forums but couldn't find a good forum/ While reading on the spark site I've the hint that here I would have better chances. So bottom line, I want to read a Blob storage where there is a contiguous feed of XML files, all small files, finaly we store these files in a Azure DW. Using Azure Databricks I can use Spark and python, but I can't find a way to 'read' the xml type. Some sample script used a library xml.etree.ElementTree but I can't get it

Drop partition columns when writing parquet in pyspark

≯℡__Kan透↙ 提交于 2020-05-17 07:07:14
问题 I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files. Here is my approach to partitioning and writing the data: df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col'))) df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet') This properly creates

Databricks spark_jar_task failed when submitted via API

非 Y 不嫁゛ 提交于 2020-05-17 05:55:47
问题 I am using to submit a sample spark_jar_task My sample spark_jar_task request to calculate Pi : "libraries": [ { "jar": "dbfs:/mnt/test-prd-foundational-projects1/spark-examples_2.11-2.4.5.jar" } ], "spark_jar_task": { "main_class_name": "org.apache.spark.examples.SparkPi" } Databricks sysout logs where it prints the Pi value as expected .... (This session will block until Rserve is shut down) Spark package found in SPARK_HOME: /databricks/spark DATABRICKS_STDOUT_END-19fc0fbc-b643-4801-b87c

Databricks spark_jar_task failed when submitted via API

不羁岁月 提交于 2020-05-17 05:54:10
问题 I am using to submit a sample spark_jar_task My sample spark_jar_task request to calculate Pi : "libraries": [ { "jar": "dbfs:/mnt/test-prd-foundational-projects1/spark-examples_2.11-2.4.5.jar" } ], "spark_jar_task": { "main_class_name": "org.apache.spark.examples.SparkPi" } Databricks sysout logs where it prints the Pi value as expected .... (This session will block until Rserve is shut down) Spark package found in SPARK_HOME: /databricks/spark DATABRICKS_STDOUT_END-19fc0fbc-b643-4801-b87c

Databricks spark_jar_task failed when submitted via API

别来无恙 提交于 2020-05-17 05:54:08
问题 I am using to submit a sample spark_jar_task My sample spark_jar_task request to calculate Pi : "libraries": [ { "jar": "dbfs:/mnt/test-prd-foundational-projects1/spark-examples_2.11-2.4.5.jar" } ], "spark_jar_task": { "main_class_name": "org.apache.spark.examples.SparkPi" } Databricks sysout logs where it prints the Pi value as expected .... (This session will block until Rserve is shut down) Spark package found in SPARK_HOME: /databricks/spark DATABRICKS_STDOUT_END-19fc0fbc-b643-4801-b87c

Convert any JSON, multiple-times nested structure into the KEY and VALUE fields

微笑、不失礼 提交于 2020-05-17 04:15:14
问题 I was requested to build an ETL pipeline in Azure. This pipeline should read ORC file submitted by the vendor to ADLS parse the PARAMS field, existing in the ORC structure, where JSON structure is stored, and add it as two new fields (KEY, VALUE) to the output write the output to the Azure SQL database The problem is, that there are different types of JSONs structures used by the different types of records. I do not want to write a custom expression per each of the class of JSON struct (there

Convert any JSON, multiple-times nested structure into the KEY and VALUE fields

↘锁芯ラ 提交于 2020-05-17 04:12:52
问题 I was requested to build an ETL pipeline in Azure. This pipeline should read ORC file submitted by the vendor to ADLS parse the PARAMS field, existing in the ORC structure, where JSON structure is stored, and add it as two new fields (KEY, VALUE) to the output write the output to the Azure SQL database The problem is, that there are different types of JSONs structures used by the different types of records. I do not want to write a custom expression per each of the class of JSON struct (there