databricks | 易学教程

How to safely restart Airflow and kill a long-running task?

阅读更多关于 How to safely restart Airflow and kill a long-running task?

问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues

How to safely restart Airflow and kill a long-running task?

阅读更多关于 How to safely restart Airflow and kill a long-running task?

How to safely restart Airflow and kill a long-running task?

阅读更多关于 How to safely restart Airflow and kill a long-running task?

Converting XML string to Spark Dataframe in Databricks

阅读更多关于 Converting XML string to Spark Dataframe in Databricks

问题 how can I build a Spark dataframe from a string which contains XML code? I can easily do it, if the code is saved in a file dfXml = (sqlContext.read.format("xml") .options(rowTag='my_row_tag') .load(xml_file_name)) However as said I have to build the dataframe from a string which contains regular XML. Thank you Mauro 回答1: On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame: val result = new XmlReader().xmlRdd(spark, rdd) If you have Dataframe as input, it can be

Converting XML string to Spark Dataframe in Databricks

阅读更多关于 Converting XML string to Spark Dataframe in Databricks

scala api for delta lake optimize command

阅读更多关于 scala api for delta lake optimize command

问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

scala api for delta lake optimize command

阅读更多关于 scala api for delta lake optimize command

scala api for delta lake optimize command

阅读更多关于 scala api for delta lake optimize command

scala api for delta lake optimize command

阅读更多关于 scala api for delta lake optimize command

Reading csv files from microsoft Azure using R

阅读更多关于 Reading csv files from microsoft Azure using R

问题 I have recently started working with databricks and azure. I have microsoft azure storage explorer. I ran a jar program on databricks which outputs many csv files in the azure storgae explorer in the path ..../myfolder/subfolder/output/old/p/ The usual thing I do is to go the folder p and download all the csv files by right clicking the p folder and click download on my local drive and these csv files in R to do any analysis. My issue is that sometimes my runs could generate more than 10000