checkpoint | 易学教程

How to set .libPaths (checkpoint) on workers when running parallel computation in R

阅读更多关于 How to set .libPaths (checkpoint) on workers when running parallel computation in R

问题 I use the checkpoint package for reproducible data analysis. Some of the computations take a long time to compute, so I want to run those in parallel. When run in parallel however the checkpoint is not set on the workers, so I get an error message "there is no package called xy" (because it is not installed in my default library directory). How can I make sure, that each worker uses the package versions in the checkpoint folder? I tried to set .libPaths in the foreach code but this does not

Flink原理（五）——容错机制

阅读更多关于 Flink原理（五）——容错机制

本文是博主阅读Flink官方文档以及《Flink基础教程》后结合自己理解所写，若有表达有误的地方欢迎大伙留言指出。 1. 前言　　　　流式计算分为有状态和无状态两种情况，所谓状态就是计算过程中的中间值。对于无状态计算，会独立观察每个独立事件，并根据最后一个事件输出结果。什么意思？大白话举例：对于一个流式系统，接受到一系列的数字，当数字大于N则输出，这时候在此之前的数字的值、和等情况，压根不关心，只和最后这个大于N的数字相关，这就是无状态计算。什么是有状态计算了？想求过去一分钟内所有数字的和或者平均数等，这种需要保存中间结果的情况是有状态的计算。　　当分布式计系统中引入状态计算时，就无可避免一致性的问题。为什么了？因为若是计算过程中出现故障，中间数据咋办了？若是不保存，那就只能重新从头计算了，不然怎么保证计算结果的正确性。这就是要求系统具有容错性了。 2. 一致性　　谈到容错性，就没法避免一致性这个概念。所谓一致性就是：成功处理故障并恢复之后得到的结果与没有发生任何故障是得到的结果相比，前者的正确性。换句大白话，就是故障的发生是否影响得到的结果。在流处理过程，一致性分为3个级别[1]：　　at-most-once：故障发生之后，计算结果可能丢失，就是无法保证结果的正确性；　　at-least-once：计算结果可能大于正确值，但绝不会小于正确值

Flink 专题 -2 Checkpoint、Savepoint 机制

阅读更多关于 Flink 专题 -2 Checkpoint、Savepoint 机制

CheckPoint 1. checkpoint 保留策略默认情况下，checkpoint 不会被保留，取消程序时即会删除他们，但是可以通过配置保留定期检查点，根据配置当作业失败或者取消的时候，不会自动清除这些保留的检查点。 java : CheckpointConfig config = env.getCheckpointConfig(); config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); ExternalizedCheckpointCleanup 可选项如下: ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION：取消作业时保留检查点。请注意，在这种情况下，您必须在取消后手动清理检查点状态。 ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION：取消作业时删除检查点。只有在作业失败时，检查点状态才可用。 2. Checkpoint 配置与SavePoint 类似 ,checkpoint 保留的是元数据文件和一些数据文件默认情况下checkpoint 只保留一份最新数据，如果需要进行checkpoint数据恢复

checkpoint and savepoint in FlinK

阅读更多关于 checkpoint and savepoint in FlinK

https://info.lightbend.com/rs/558-NCX-702/images/preview-apache-flink.pdf https://www.microsoft.com/en-us/research/uploads/prod/2016/12/Determining-Global-States-of-a-Distributed-System.pdf https://arxiv.org/pdf/1506.08603.pdf savepoints https://data-artisans.com/blog/turning-back-time-savepoints checkpoint https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html https://data-artisans.com/blog/end-to-end-exactly-once-processing-apache-flink-apache-kafka Apache Flink fault tolerance源码剖析 https://yq.aliyun.com/articles/259147 https://yq.aliyun.com/articles/259146

SparkStreaming之checkpoint检查点

阅读更多关于 SparkStreaming之checkpoint检查点

一.简介　　流应用程序必须保证7*24全天候运行，因此必须能够适应与程序逻辑无关的故障【例如：系统故障、JVM崩溃等】。为了实现这一点，SparkStreaming需要将足够的信息保存到容错存储系统中，以便它可以从故障中恢复。　　检查点有两种类型。　　　　1.元数据检查点　　　　　　将定义流式计算的信息保存到容错存储系统【如HDFS等】。这用于从运行流应用程序所在的节点的故障中恢复。　　　　　　元数据包括：　　　　　　　　1.配置　　　　　　　　　　用于创建流应用程序的配置。　　　　　　　　2.DStream操作　　　　　　　　　　定义流应用程序的DStream操作集。　　　　　　　　3.不完整的批次　　　　　　　　　　在任务队列中而尚未完成的批次。　　　　2.数据检查点　　　　　　将生成的RDD保存到可靠的存储系统。在一些跨多个批次组合数据的有状态转换中，这是必须的。在这种转换中，生成的RDD依赖于先前批次的RDD，这导致依赖关系链的长度随着时间而增加。为了避免恢复时间的这种无限增加【与依赖链成正比】，有状态变换的中间RDD周期性地检查以存储到可靠的存储系统中，以切断依赖链。　　总而言之，元数据检查点主要用于从节点故障中恢复，而如果使用状态转换，即使对于基本功能也需要数据或RDD检查点。二.需要设置检查点的情况　　1.有状态转换的使用

12C ogg之坑爹又坑队友报错OGG-00868 ORA-01291: missing logf

阅读更多关于 12C ogg之坑爹又坑队友报错OGG-00868 ORA-01291: missing logf

　　同事正常操作，并停止一个ogg进程，数据库是12c的。ogg当然也是12c的版本。一切都是一个正常的操作，但是出了坑爹的效应，差不多四个人，搞了近3个小时吧。下面我们看看详细的报错; 　　GGSCI (dwdb1) 1> info all 　　Program Status Group Lag at Chkpt Time Since Chkpt 　　MANAGER RUNNING 　　EXTRACT RUNNING EK_ZW1 00:00:03 00:00:05 　　EXTRACT STOPPED EXT_KAF1 00:00:04 00:20:06 　　EXTRACT RUNNING PMP_KAF1 00:00:00 00:00:03 　　EXTRACT RUNNING PM_ZW1 00:00:00 00:00:08 　　REPLICAT RUNNING REP_EWM1 00:00:00 00:00:02 　　REPLICAT RUNNING REP_HX4 00:00:04 00:00:00 　　我们report这个进程，得到的报错是　　2019-08-13 21:24:07 ERROR OGG-00868 Error code 1291, error message: ORA-01291: missing logfile 　　(Missing Log File

Maximum number of WAL files in the pg_xlog directory (2)

阅读更多关于 Maximum number of WAL files in the pg_xlog directory (2)

Jeff Janes : Hi, As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: greatest( (2 + checkpoint_completion_target) * checkpoint_segments + 1, checkpoint_segments + wal_keep_segments + 1 ) I think the first bug is even having this

Maximum number of WAL files in the pg_xlog directory (1)

阅读更多关于 Maximum number of WAL files in the pg_xlog directory (1)

Guillaume Lelarge : Hi, As part of our monitoring work for our customers, we stumbled upon an issue with our customers' servers who have a wal_keep_segments setting higher than 0. We have a monitoring script that checks the number of WAL files in the pg_xlog directory, according to the setting of three parameters ( checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). We usually add a percentage to the usual formula: greatest( (2 + checkpoint_completion_target) * checkpoint_segments + 1, checkpoint_segments + wal_keep_segments + 1 ) And we have lots of alerts from the

PgSQL · 追根究底 · WAL日志空间的意外增长

阅读更多关于 PgSQL · 追根究底 · WAL日志空间的意外增长

问题出现我们在线上巡检中发现，一个实例的pg_xlog目录，增长到4G，很是疑惑。刚开始怀疑是日志归档过慢，日志堆积在pg_xlog目录下面，未被清除导致。于是检查归档目录下的文件，内容如下。但发现新近完成写入的日志文件都被归档成功了（即在pg_xlog/archive_status里面，有对应的xxx.done文件）。 ls - lrt pg_xlog ... -rw------- 1 xxxx xxxx 16777216 Jun 14 18 : 39 0000000100000035000000DE -rw------- 1 xxxx xxxx 16777216 Jun 14 18 : 39 0000000100000035000000DF drwx ------ 2 xxxx xxxx 73728 Jun 14 18 : 39 archive_status -rw------- 1 xxxx xxxx 16777216 Jun 14 18 : 39 0000000100000035000000E0 ls -lrt pg_xlog/ archive_status ... -rw------- 1 xxxx xxxx 0 Jun 14 18 : 39 0000000100000035000000DE. done -rw------- 1 xxxx xxxx 0 Jun 14

Deleting backup_label on restore will corrupt your database!

阅读更多关于 Deleting backup_label on restore will corrupt your database!

The quick summary of this issue is that the backup_label file is an integral part of your database cluster binary backup, and removing it to allow the recovery to proceed without error is very likely to corrupt your database. Don't do that. Note that this post does not attempt to provide complete instructions for how to restore from a binary backup -- the documentation has all that, and it is of no benefit to duplicate it here; this is to warn people about a common error in the process that can corrupt databases when people try to take short-cuts rather than following the steps described in

订阅 checkpoint