distributed

When do I use a consensus algorithm like Paxos vs using a something like a Vector Clock?

这一生的挚爱 提交于 2020-04-08 19:01:32
问题 I've been reading a lot about different strategies to guarantee consistency between nodes in distributed systems, but I'm having a bit of trouble figuring out when to use which algorithm. With what kind of system would I use something like a vector clock? Which system is ideal for using something like Paxos? Are the two mutually exclusive? 回答1: There's a distributed system of 2 nodes that store data. The data is replicated to both nodes so that if one node dies, the data is not lost

Finding out deployment machine you are on in code (Rails)

☆樱花仙子☆ 提交于 2020-01-25 21:52:50
问题 My rails app is deployed to several machines. I need each machine to run different cron jobs (it will be a disaster if they all run the job). How do i tell my script which machine it is currently on? I am using the whenever gem, and i am thinking of adding the condition in the schedule.rb Example: My deploy/production.rb role :memcache, "123.compute-1.amazonaws.com" role :web, "456.compute-1.amazonaws.com" role :db, "789.amazonaws.com" role :misc, "789.amazonaws.com" What I need to do: if

Distributed Tensorflow 1.0 Supervisor stuck if logdir is in HDFS

谁说胖子不能爱 提交于 2020-01-25 21:41:05
问题 I build the TF 1.0 binary on centOS 8 for CPU. My distributed training code for MNIST data works fine if the Supervisor’s logdir is in local disk. But if I change Supervisor’s logdir to HDFS, the code will stuck at Supervisor’s initialization: sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), logdir='hdfs://cdh-2:8020/tmp/example', global_step=global_step, init_op=init_op) I used gdb and found the C stack trace. It seems it has problems in _wrap_RecursivelyCreateDir() #0

Using Paxos to synchronize a large file across nodes

爱⌒轻易说出口 提交于 2020-01-24 17:29:09
问题 I'm trying to use Paxos to maintain consensus between nodes on a file that is around 50MB in size, and constantly being modified at individual nodes. I'm running into issues of practicality. Requirements: Sync a 50MB+ file across hundreds of nodes Have changes to this file, which can be made from any node, and aren't likely to directly compete with each other, propagated across the network in a few seconds at most New nodes that join the network can within a few minutes (<1 hour) build up the

Managing a Large Number of Log Files Distributed Over Many Machines

前提是你 提交于 2020-01-22 20:51:50
问题 We have started using a third party platform (GigaSpaces) that helps us with distributed computing. One of the major problems we are trying to solve now is how to manage our log files in this distributed environment. We have the following setup currently. Our platform is distributed over 8 machines. On each machine we have 12-15 processes that log to separate log files using java.util.logging. On top of this platform we have our own applications that use log4j and log to separate files. We

Managing a Large Number of Log Files Distributed Over Many Machines

a 夏天 提交于 2020-01-22 20:51:05
问题 We have started using a third party platform (GigaSpaces) that helps us with distributed computing. One of the major problems we are trying to solve now is how to manage our log files in this distributed environment. We have the following setup currently. Our platform is distributed over 8 machines. On each machine we have 12-15 processes that log to separate log files using java.util.logging. On top of this platform we have our own applications that use log4j and log to separate files. We

Condor job using DAG with some jobs needing to run the same host

こ雲淡風輕ζ 提交于 2020-01-15 04:24:06
问题 I have a computation task which is split in several individual program executions, with dependencies. I'm using Condor 7 as task scheduler (with the Vanilla Universe, due do constraints on the programs beyond my reach, so no checkpointing is involved), so DAG looks like a natural solution. However some of the programs need to run on the same host. I could not find a reference on how to do this in the Condor manuals. Example DAG file: JOB A A.condor JOB B B.condor JOB C C.condor JOB D D.condor

Distributed primary key - UUID, simple auto increment or custom sequential values?

泪湿孤枕 提交于 2020-01-14 09:49:07
问题 I know this type of question has been asked before, but I could not find one that compared the options I have in mind. So I am going to post them here, please post links if there are duplicates. This has ended up a rather long post, if you have time please read it through as the question is at the end EDIT2: I've accepted an answer as I think that will be the best solution for now. But I thought I would like to two other questions that answer my query about concatenating numbers. They can be

Dstributed Erlang, networking, echo server

偶尔善良 提交于 2020-01-14 03:03:08
问题 so my year long project involves doing some pretty nasty stuff with Erlang. I am just starting to get to grips with it. I have very limited knowledge of networking and need to ask how/why something doesn't work. I followed the code from this guide, http://20bits.com/article/network-programming-in-erlang, that provides Erlang code for a server to echo back text that is fed into it The instructions to run it is to start the erlang code, he reccomends port 8888, and then telnet to the localhost

What is the TrueTime API in Google's Spanner?

一曲冷凌霜 提交于 2020-01-13 10:04:16
问题 I tried to read the document multiple times but failed to understand it. Can someone explain it in layman's terms? 回答1: TrueTime is an API available at Google that directly exposes clock uncertainty. Comparing to standard datetime libraries, instead of a particular timestamp, TrueTime's now() function returns an interval of time [earliest, latest]. It also provides two functions: after(t) returns true if t has definitely passed. E.g. t < now().earliest . before(t) returns true if t has