distributed

How to make the generated data in remote worker span iterations in in-graph replica in distributed tensorflow?

若如初见. 提交于 2019-12-11 06:44:45
问题 I use the in-graph replication of tensorflow to do distributed training. For reducing communicaiton cost purpose, i need hold some generated data (such as the cell states in LSTM) in some remote worker in one training iteration to next iteration, but i found that i can not achieve it. If i use the fetch operation of 'session.run' interface to retrieve the data generated in one remote worker, and feed the data to this remoter worker in the next training iteration, the unnecessary network costs

How to deploy distributed h2o flow cluster with docker?

南楼画角 提交于 2019-12-11 05:08:46
问题 I'm able to deploy a h2o cluster with ec2 instances and having the private ip in the flatfile. Doing the same with docker works but I can't figure out what to enter into the flatfile so they can create the cluster. Private IP the container is running on is not working 回答1: Can the containers ping each others ip? When launching h2o are you forcing the interface to use the container ip? java -jar h2o.jar -flatfile flatfile -ip -port Are these docker containers when run exposing the port 54321

Does tf.train.SyncReplicasOptimizer do complete parameter update from aggregated gradients to value for many times?

我只是一个虾纸丫 提交于 2019-12-11 04:39:37
问题 In /model/inception/inception/inception_distributed_training.py apply_gradients are called for each worker. apply_gradients_op = opt.apply_gradients(grads, global_step=global_step) and go into SyncReplicasOptimizer.py: 285 # sync_op will be assigned to the same device as the global step. 286 with ops.device(global_step.device), ops.name_scope(""): 287 update_op = self._opt.apply_gradients(aggregated_grads_and_vars, 288 global_step) 289 line 287 are will be executed by each worker process at

Simple Distributed Erlang

大城市里の小女人 提交于 2019-12-11 00:40:19
问题 I've got a simple module: -module(dist). -compile([add/3]). add(From,X,Y) -> From ! X+Y. And I'm starting two nodes. One with erl -sname foo and another with erl -sname bar On the bar node I'm doing: > c(dist). {ok,dist} > self(). <0.37.0> > spawn('foo@unknown-00-23-6c-83-af-bd', dist, add, [self(), 3, 5]). But the reponse I get is: Error in process <0.48.0> on node 'foo@unknown-00-23-6c-83-af-bd' with exit value: {undef,[{dist,add,[<8965.37.0>,3,5]}]} What does this error mean? I wondered if

Majordomo broker: handling large number of connections

空扰寡人 提交于 2019-12-10 20:47:39
问题 I am using the majordomo code found here (https://github.com/zeromq/majordomo) in the following manner: Instead of using a single broker to process the requests and replies, I start two brokers such that one of them handles all the requests, and the other handles all the replies. I did some testing to see how many connections the majordomo broker can handle: num of reqs per client num of requests handled without pkt loss 1 614 (614 clients) 10 6000 (600 clients) 100 35500 (355 clients) 1000

How to build a large distributed [sparse] matrix in Apache Spark 1.0?

大憨熊 提交于 2019-12-10 18:14:38
问题 I have an RDD as such: byUserHour: org.apache.spark.rdd.RDD[(String, String, Int)] I would like to create a sparse matrix of the data for calculations like median, mean, etc. The RDD contains the row_id, column_id and value. I have two Arrays containing the row_id and column_id strings for lookup. Here is my attempt: import breeze.linalg._ val builder = new CSCMatrix.Builder[Int](rows=BCnUsers.value.toInt,cols=broadcastTimes.value.size) byUserHour.foreach{x => val row = userids.indexOf(x._1)

distributed MAKE

試著忘記壹切 提交于 2019-12-10 16:06:07
问题 I had a MAKE compilation process that took around 1 hour to complete earlier. I used the -j command and was able to reduce it to 40 mins. What I observed is that the CPU utilization was high and my mentor suggested me to distribute the jobs on different SERVERS or machines available with our organization. I read about distcc but it can be used for c code only and we have a mix of c and java code. Kindly suggest me an appropriate tool to look for and which is the easiest to install and deploy

Architectural design for data consistency on distributed analytic system

我的未来我决定 提交于 2019-12-10 14:55:04
问题 I am refactoring an Analytic system that will do a lot of calculation, and I need some ideas on possible architectural designs to a data consistency issue I am facing. Current Architecture I have a queue based system, in which different requesting applications create messages that are eventually consumed by workers. Each " Requesting App " breaks down a large calculation into smaller pieces that will be sent to the queue and processed by the workers . When all the pieces are finished, the

How scalable are automatic secondary indexes in Cassandra 0.7?

徘徊边缘 提交于 2019-12-10 12:54:47
问题 As far as I understand automatic secondary indexes are generated for node local data. In this case query by secondary index involve all nodes storing part of column family to get results (?) so (if i am right) if data is spread across 50 nodes then 50 nodes are involved in single query? How far can this scale? Is this more scalable than manual secondary indexes (inverted index column family)? Few nodes or hundred nodes? 回答1: See Stu's answer from the ml http://www.mail-archive.com/user

Multi-tier vs Distibuted?

强颜欢笑 提交于 2019-12-10 11:02:29
问题 Multi-tier and/or ditstributed apps, do they have the same meaning ? When we talk about layers in these apps, is it physical layers (database, browser, web server,...) or logical layers (data access layer, business layer,...) ? 回答1: Maybe these two sentences do convey intuitively the distinction between distributed and multi-tier : Distributed : You replicate the processing amongst nodes Multi-tier : You split the processing amongst tiers In one case, the same processing is replicated over