databricks

write a spark Dataset to json with all keys in the schema, including null columns

你说的曾经没有我的故事 提交于 2020-02-02 04:06:06
问题 I am writing a dataset to json using: ds.coalesce(1).write.format("json").option("nullValue",null).save("project/src/test/resources") For records that have columns with null values, the json document does not write that key at all. Is there a way to enforce null value keys to the json output? This is needed since I use this json to read it onto another dataset (in a test case) and cannot enforce a schema if some documents do not have all the keys in the case class (I am reading it by putting

Possible to handle multi character delimiter in spark [duplicate]

有些话、适合烂在心里 提交于 2020-01-30 06:27:44
问题 This question already has answers here : Does spark-sql support multiple delimiters in the input data? (1 answer) How to split using multi-char separator with pipe? (1 answer) Closed 2 years ago . I have [~] as my delimiter for some csv files I am reading. 1[~]a[~]b[~]dd[~][~]ww[~][~]4[~]4[~][~][~][~][~] I have tried this val rddFile = sc.textFile("file.csv") val rddTransformed = rddFile.map(eachLine=>eachLine.split("[~]")) val df = rddTransformed.toDF() display(df) However this issue with

Possible to handle multi character delimiter in spark [duplicate]

孤街醉人 提交于 2020-01-30 06:27:25
问题 This question already has answers here : Does spark-sql support multiple delimiters in the input data? (1 answer) How to split using multi-char separator with pipe? (1 answer) Closed 2 years ago . I have [~] as my delimiter for some csv files I am reading. 1[~]a[~]b[~]dd[~][~]ww[~][~]4[~]4[~][~][~][~][~] I have tried this val rddFile = sc.textFile("file.csv") val rddTransformed = rddFile.map(eachLine=>eachLine.split("[~]")) val df = rddTransformed.toDF() display(df) However this issue with

How to install PYODBC in Databricks

限于喜欢 提交于 2020-01-28 12:31:31
问题 I have to install pyodbc module in Databricks. I have tried using this command ( pip install pyodbc ) but it is failed due to below error. Error message 回答1: I had some problems a while back with connecting using pyobdc, details of my fix are here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark I think the problem stems from PYTHONPATH on the databricks clusters being set to the Python 2 install. I suspect the lines: %sh apt-get -y install

Databricks Job timed out with error : Lost executor 0 on [IP]. Remote RPC client disassociated

戏子无情 提交于 2020-01-25 10:13:09
问题 Complete error : Databricks Job timed out with error : Lost executor 0 on [IP]. Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. We are running jobs using Jobs API 2.0 on Azure Databricks subscription and using the Pools interface for less spawn time and using the worker/driver as Standard_DS12_v2. We have a job(JAR main) which has just one SQL procedure call. This call takes more than 1.2 hour to complete.

Azure Data-bricks : How to read part files and save it as one file to blob?

左心房为你撑大大i 提交于 2020-01-25 08:34:48
问题 I am using Python spark writing a data-frame to a folder in blob which gets saved as part files : df.write.format("json").save("/mnt/path/DataModel") Files are saved as : i am using following code to merge it into one file : #Read Part files path = glob.glob("/dbfs/mnt/path/DataModel/part-000*.json") #Move file to FinalData folder in blbo for file in path: shutil.move(file,"/dbfs/mnt/path/FinalData/FinalData.json") But FinalData.Json only have last part file data and not data of all part

recursive cte in spark SQL

大憨熊 提交于 2020-01-25 05:22:24
问题 ; WITH Hierarchy as ( select distinct PersonnelNumber , Email , ManagerEmail from dimstage union all select e.PersonnelNumber , e.Email , e.ManagerEmail from dimstage e join Hierarchy as h on e.Email = h.ManagerEmail ) select * from Hierarchy Can you help acheive the same in SPARK SQL 回答1: This is not possible using SPARK SQL. The WITH clause exists, but not for CONNECT BY like in, say, ORACLE, or recursion in DB2. 来源: https://stackoverflow.com/questions/52562607/recursive-cte-in-spark-sql

recursive cte in spark SQL

别来无恙 提交于 2020-01-25 05:21:45
问题 ; WITH Hierarchy as ( select distinct PersonnelNumber , Email , ManagerEmail from dimstage union all select e.PersonnelNumber , e.Email , e.ManagerEmail from dimstage e join Hierarchy as h on e.Email = h.ManagerEmail ) select * from Hierarchy Can you help acheive the same in SPARK SQL 回答1: This is not possible using SPARK SQL. The WITH clause exists, but not for CONNECT BY like in, say, ORACLE, or recursion in DB2. 来源: https://stackoverflow.com/questions/52562607/recursive-cte-in-spark-sql

NameError: name 'dbutils' is not defined in pyspark

时光毁灭记忆、已成空白 提交于 2020-01-24 10:48:47
问题 I am running a pyspark job in databricks cloud. I need to write some of the csv files to databricks filesystem (dbfs) as part of this job and also i need to use some of the dbutils native commands like, #mount azure blob to dbfs location dbutils.fs.mount (source="...",mount_point="/mnt/...",extra_configs="{key:value}") I am also trying to unmount once the files has been written to the mount directory. But, when i am using dbutils directly in the pyspark job it is failing with NameError: name

NameError: name 'dbutils' is not defined in pyspark

独自空忆成欢 提交于 2020-01-24 10:46:26
问题 I am running a pyspark job in databricks cloud. I need to write some of the csv files to databricks filesystem (dbfs) as part of this job and also i need to use some of the dbutils native commands like, #mount azure blob to dbfs location dbutils.fs.mount (source="...",mount_point="/mnt/...",extra_configs="{key:value}") I am also trying to unmount once the files has been written to the mount directory. But, when i am using dbutils directly in the pyspark job it is failing with NameError: name