databricks

How to safely restart Airflow and kill a long-running task?

风流意气都作罢 提交于 2021-01-07 06:20:15
问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues

How to safely restart Airflow and kill a long-running task?

删除回忆录丶 提交于 2021-01-07 06:18:58
问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues

How to safely restart Airflow and kill a long-running task?

给你一囗甜甜゛ 提交于 2021-01-07 06:18:53
问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues

Converting XML string to Spark Dataframe in Databricks

主宰稳场 提交于 2021-01-07 02:01:29
问题 how can I build a Spark dataframe from a string which contains XML code? I can easily do it, if the code is saved in a file dfXml = (sqlContext.read.format("xml") .options(rowTag='my_row_tag') .load(xml_file_name)) However as said I have to build the dataframe from a string which contains regular XML. Thank you Mauro 回答1: On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame: val result = new XmlReader().xmlRdd(spark, rdd) If you have Dataframe as input, it can be

Converting XML string to Spark Dataframe in Databricks

大憨熊 提交于 2021-01-07 02:01:12
问题 how can I build a Spark dataframe from a string which contains XML code? I can easily do it, if the code is saved in a file dfXml = (sqlContext.read.format("xml") .options(rowTag='my_row_tag') .load(xml_file_name)) However as said I have to build the dataframe from a string which contains regular XML. Thank you Mauro 回答1: On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame: val result = new XmlReader().xmlRdd(spark, rdd) If you have Dataframe as input, it can be

scala api for delta lake optimize command

我只是一个虾纸丫 提交于 2021-01-06 07:38:54
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

scala api for delta lake optimize command

爷,独闯天下 提交于 2021-01-06 07:34:44
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

scala api for delta lake optimize command

谁说我不能喝 提交于 2021-01-06 07:34:42
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

scala api for delta lake optimize command

人盡茶涼 提交于 2021-01-06 07:34:23
问题 The databricks docs say that you can change zordering of a delta table by doing: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)") The problem with this is the switching between the scala and SQL api which is gross. What I want to be able to do is: spark.read.table(connRandom) .write.format("delta").saveAsTable(connZorder) .optimize.zorderBy("src_ip", "src_port", "dst_ip", "dst_port") but I

Reading csv files from microsoft Azure using R

こ雲淡風輕ζ 提交于 2020-12-31 11:48:09
问题 I have recently started working with databricks and azure. I have microsoft azure storage explorer. I ran a jar program on databricks which outputs many csv files in the azure storgae explorer in the path ..../myfolder/subfolder/output/old/p/ The usual thing I do is to go the folder p and download all the csv files by right clicking the p folder and click download on my local drive and these csv files in R to do any analysis. My issue is that sometimes my runs could generate more than 10000