spark-checkpoint

Does the state also gets removed on event timeout with spark structured streaming?

試著忘記壹切 提交于 2021-02-05 09:26:35
问题 Q. Does the state gets timed out and also gets removed at the same time or only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout? I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout. Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say : ds.withWatermark("timestamp", "10 seconds") .groupByKey(...) .mapGroupsWithState(

Spark not able to find checkpointed data in HDFS after executor fails

泪湿孤枕 提交于 2020-05-28 03:29:15
问题 I am sreaming data from Kafka as below: final JavaPairDStream<String, Row> transformedMessages = rtStream .mapToPair(record -> new Tuple2<String, GenericDataModel>(record.key(), record.value())) .mapWithState(StateSpec.function(updateDataFunc).numPartitions(32)).stateSnapshots() .foreachRDD(rdd -> { --logic goes here }); I have four workers threads, and multiple executors for this application, and i am trying to check fault tolerance of Spark. Since we are using mapWithState, spark is

reading from hive table and updating same table in pyspark - using checkpoint

半腔热情 提交于 2019-12-19 04:21:28
问题 I am using spark version 2.3 and trying to read hive table in spark as: from pyspark.sql import SparkSession from pyspark.sql.functions import * df = spark.table("emp.emptable") here I am adding a new column with current date from system to the existing dataframe import pyspark.sql.functions as F newdf = df.withColumn('LOAD_DATE', F.current_date()) and now facing an issue,when I am trying to write this dataframe as hive table newdf.write.mode("overwrite").saveAsTable("emp.emptable") pyspark

Iterative caching vs checkpointing in Spark

筅森魡賤 提交于 2019-12-11 04:38:13
问题 I have an iterative application running on Spark that I simplified to the following code: var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000)) var c: Long = Int.MaxValue var iteration: Int = 0 while (c > 0) { iteration += 1 // Manipulate the RDD and cache the new RDD anRDD = anRDD.zipWithIndex.filter(t => t._2 % 2 == 1).map(_._1).cache() //.localCheckpoint() // Actually compute the RDD and spawn a new job c = anRDD.count() println(s"Iteration: $iteration, Values: $c") } What

reading from hive table and updating same table in pyspark - using checkpoint

≡放荡痞女 提交于 2019-12-01 00:24:22
I am using spark version 2.3 and trying to read hive table in spark as: from pyspark.sql import SparkSession from pyspark.sql.functions import * df = spark.table("emp.emptable") here I am adding a new column with current date from system to the existing dataframe import pyspark.sql.functions as F newdf = df.withColumn('LOAD_DATE', F.current_date()) and now facing an issue,when I am trying to write this dataframe as hive table newdf.write.mode("overwrite").saveAsTable("emp.emptable") pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;' so I