What is a distributed system?
multiple cooperating computers
storage for big web sites, MapReduce, peer-to-peer sharing, &c
lots of critical infrastructure is distributed
Why do people build distributed systems?
to increase capacity via parallelism
to tolerate faults via replication
to place computing physically close to external entities
to achieve security via isolation
But:
many concurrent parts, complex interactions
must cope with partial failure
tricky to realize performance potential
MAIN TOPICS
This is a course about infrastructure for applications.
* Storage.
* Communication.
* Computation.
The big goal: abstractions that hide the complexity of distribution.
A couple of topics will come up repeatedly in our search.
Topic: implementation
RPC, threads, concurrency control.
The labs...
Topic: performance
The goal: scalable throughput
Nx servers -> Nx total throughput via parallel CPU, disk, net.
[diagram: users, application servers, storage servers]
So handling more load only requires buying more computers.
Rather than re-design by expensive programmers.
Effective when you can divide work w/o much interaction.
Scaling gets harder as N grows:
Load im-balance, stragglers, slowest-of-N latency.
Non-parallelizable code: initialization, interaction.
Bottlenecks from shared resources, e.g. network.
Some performance problems aren't easily solved by scaling
e.g. quick response time for a single user request
e.g. all users want to update the same data
often requires better design rather than just more computers
Lab 4
Topic: fault tolerance
1000s of servers, big network -> always something broken
We'd like to hide these failures from the application.
We often want:
Availability -- app can make progress despite failures
Recoverability -- app will come back to life when failures are repaired
Big idea: replicated servers.
If one server crashes, can proceed using the other(s).
Labs 1, 2 and 3
Topic: consistency
General-purpose infrastructure needs well-defined behavior.
E.g. "Get(k) yields the value from the most recent Put(k,v)."
Achieving good behavior is hard!
"Replica" servers are hard to keep identical.
Clients may crash midway through multi-step update.
Servers may crash, e.g. after executing but before replying.
Network partition may make live servers look dead; risk of "split brain".
Consistency and performance are enemies.
Strong consistency requires communication,
e.g. Get() must check for a recent Put().
Many designs provide only weak consistency, to gain speed.
e.g. Get() does *not* yield the latest Put()!
Painful for application programmers but may be a good trade-off.
CASE STUDY: MapReduce
Let's talk about MapReduce (MR) as a case study
a good illustration of 6.824's main topics
hugely influential
the focus of Lab 1
MapReduce overview
context: multi-hour computations on multi-terabyte data-sets
e.g. build search index, or sort, or analyze structure of web
only practical with 1000s of computers
applications not written by distributed systems experts
overall goal: easy for non-specialist programmers
programmer just defines Map and Reduce functions
often fairly simple sequential code
MR takes care of, and hides, all aspects of distribution!
Abstract view of a MapReduce job
input is (already) split into M files
Input1 -> Map -> a,1 b,1
Input2 -> Map -> b,1
Input3 -> Map -> a,1 c,1
| | |
| | -> Reduce -> c,1
| -----> Reduce -> b,2
---------> Reduce -> a,2
MR calls Map() for each input file, produces set of k2,v2
"intermediate" data
each Map() call is a "task"
MR gathers all intermediate v2's for a given k2,
and passes each key + values to a Reduce call
final output is set of <k2,v3> pairs from Reduce()s
Example: word count
input is thousands of text files
Map(k, v)
split v into words
for each word w
emit(w, "1")
Reduce(k, v)
emit(len(v))
MapReduce scales well:
N "worker" computers get you Nx throughput.
Maps()s can run in parallel, since they don't interact.
Same for Reduce()s.
So you can get more throughput by buying more computers.
MapReduce hides many details:
sending app code to servers
tracking which tasks are done
moving data from Maps to Reduces
balancing load over servers
recovering from failures
However, MapReduce limits what apps can do:
No interaction or state (other than via intermediate output).
No iteration, no multi-stage pipelines.
No real-time or streaming processing.
Input and output are stored on the GFS cluster file system
MR needs huge parallel input and output throughput.
GFS splits files over many servers, in 64 MB chunks
Maps read in parallel
Reduces write in parallel
GFS also replicates each file on 2 or 3 servers
Having GFS is a big win for MapReduce
What will likely limit the performance?
We care since that's the thing to optimize.
CPU? memory? disk? network?
In 2004 authors were limited by network capacity.
What does MR send over the network?
Maps read input from GFS.
Reduces read Map output.
Can be as large as input, e.g. for sorting.
Reduces write output files to GFS.
[diagram: servers, tree of network switches]
In MR's all-to-all shuffle, half of traffic goes through root switch.
Paper's root switch: 100 to 200 gigabits/second, total
1800 machines, so 55 megabits/second/machine.
55 is small, e.g. much less than disk or RAM speed.
Today: networks and root switches are much faster relative to CPU/disk.
Some details (paper's Figure 1):
one master, that hands out tasks to workers and remembers progress.
1. master gives Map tasks to workers until all Maps complete
Maps write output (intermediate data) to local disk
Maps split output, by hash, into one file per Reduce task
2. after all Maps have finished, master hands out Reduce tasks
each Reduce fetches its intermediate output from (all) Map workers
each Reduce task writes a separate output file on GFS
How does MR minimize network use?
Master tries to run each Map task on GFS server that stores its input.
All computers run both GFS and MR workers
So input is read from local disk (via GFS), not over network.
Intermediate data goes over network just once.
Map worker writes to local disk.
Reduce workers read directly from Map workers, not via GFS.
Intermediate data partitioned into files holding many keys.
R is much smaller than the number of keys.
Big network transfers are more efficient.
How does MR get good load balance?
Wasteful and slow if N-1 servers have to wait for 1 slow server to finish.
But some tasks likely take longer than others.
Solution: many more tasks than workers.
Master hands out new tasks to workers who finish previous tasks.
So no task is so big it dominates completion time (hopefully).
So faster servers do more tasks than slower ones, finish abt the same time.
What about fault tolerance?
I.e. what if a worker crashes during a MR job?
We want to completely hide failures from the application programmer!
Does MR have to re-run the whole job from the beginning?
Why not?
MR re-runs just the failed Map()s and Reduce()s.
Suppose MR runs a Map twice, one Reduce sees first run's output,
another Reduce sees the second run's output?
Correctness requires re-execution to yield exactly the same output.
So Map and Reduce must be pure deterministic functions:
they are only allowed to look at their arguments.
no state, no file I/O, no interaction, no external communication.
What if you wanted to allow non-functional Map or Reduce?
Worker failure would require whole job to be re-executed,
or you'd need to create synchronized global checkpoints.
Details of worker crash recovery:
* Map worker crashes:
master notices worker no longer responds to pings
master knows which Map tasks it ran on that worker
those tasks' intermediate output is now lost, must be re-created
master tells other workers to run those tasks
can omit re-running if Reduces already fetched the intermediate data
* Reduce worker crashes.
finished tasks are OK -- stored in GFS, with replicas.
master re-starts worker's unfinished tasks on other workers.
Other failures/problems:
* What if the master gives two workers the same Map() task?
perhaps the master incorrectly thinks one worker died.
it will tell Reduce workers about only one of them.
* What if the master gives two workers the same Reduce() task?
they will both try to write the same output file on GFS!
atomic GFS rename prevents mixing; one complete file will be visible.
* What if a single worker is very slow -- a "straggler"?
perhaps due to flakey hardware.
master starts a second copy of last few tasks.
* What if a worker computes incorrect output, due to broken h/w or s/w?
too bad! MR assumes "fail-stop" CPUs and software.
* What if the master crashes?
Current status?
Hugely influential (Hadoop, Spark, &c).
Probably no longer in use at Google.
Replaced by Flume / FlumeJava (see paper by Chambers et al).
GFS replaced by Colossus (no good description), and BigTable.
Conclusion
MapReduce single-handedly made big cluster computation popular.
- Not the most efficient or flexible.
+ Scales well.
+ Easy to program -- failures and data movement are hidden.
These were good trade-offs in practice.
We'll see some more advanced successors later in the course.
Have fun with the lab!
Your Job
Your job is to implement a distributed MapReduce, consisting of two programs, the master and the worker. There will be just one master process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you'll run them all on a single machine. The workers will talk to the master via RPC. Each worker process will ask the master for a task, read the task's input from one or more files, execute the task, and write the task's output to one or more files. The master should notice if a worker hasn't completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.
We have given you a little code to start you off. The "main" routines for the master and worker are in main/mrmaster.go and main/mrworker.go; don't change these files. You should put your implementation in mr/master.go, mr/worker.go, and mr/rpc.go.
Here's how to run your code on the word-count MapReduce application. First, make sure the word-count plugin is freshly built:
$ go build -buildmode=plugin ../mrapps/wc.go
In the main directory, run the master.
$ rm mr-out*
$ go run mrmaster.go pg-*.txt
The pg-*.txt arguments to mrmaster.go are the input files; each file corresponds to one "split", and is the input to one Map task.
In one or more other windows, run some workers:
$ go run mrworker.go wc.so
When the workers and master have finished, look at the output in mr-out-*. When you've completed the lab, the sorted union of the output files should match the sequential output, like this:
$ cat mr-out-* | sort | more
A 509
ABOUT 2
ACT 8
...
We supply you with a test script in main/test-mr.sh. The tests check that the wc and indexer MapReduce applications produce the correct output when given the pg-xxx.txt files as input. The tests also check that your implementation runs the Map and Reduce tasks in parallel, and that your implementation recovers from workers that crash while running tasks.
If you run the test script now, it will hang because the master never finishes:
$ cd ~/6.824/src/main
$ sh test-mr.sh
*** Starting wc test.
You can change ret := false to true in the Done function in mr/master.go so that the master exits immediately. Then:
$ sh ./test-mr.sh
*** Starting wc test.
sort: No such file or directory
cmp: EOF on mr-wc-all
--- wc output is not the same as mr-correct-wc.txt
--- wc test: FAIL
$
The test script expects to see output in files named mr-out-X, one for each reduce task. The empty implementations of mr/master.go and mr/worker.go don't produce those files (or do much of anything else), so the test fails.
When you've finished, the test script output should look like this:
$ sh ./test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
--- reduce parallelism test: PASS
*** Starting crash test.
--- crash test: PASS
*** PASSED ALL TESTS
$
You'll also see some errors from the Go RPC package that look like
2019/12/16 13:27:09 rpc.Register: method "Done" has 1 input parameters; needs exactly three
Ignore these messages.
- The map phase should divide the intermediate keys into buckets for nReduce reduce tasks, where nReduce is the argument that main/mrmaster.go passes to MakeMaster().
- The worker implementation should put the output of the X'th reduce task in the file mr-out-X.
- A mr-out-X file should contain one line per Reduce function output. The line should be generated with the Go "%v %v" format, called with the key and value. Have a look in main/mrsequential.go for the line commented "this is the correct format". The test script will fail if your implementation deviates too much from this format.
- You can modify mr/worker.go, mr/master.go, and mr/rpc.go. You can temporarily modify other files for testing, but make sure your code works with the original versions; we'll test with the original versions.
- The worker should put intermediate Map output in files in the current directory, where your worker can later read them as input to Reduce tasks.
- main/mrmaster.go expects mr/master.go to implement a Done() method that returns true when the MapReduce job is completely finished; at that point, mrmaster.go will exit.
- When the job is completely finished, the worker processes should exit. A simple way to implement this is to use the return value from call(): if the worker fails to contact the master, it can assume that the master has exited because the job is done, and so the worker can terminate too. Depending on your design, you might also find it helpful to have a "please exit" pseudo-task that the master can give to workers.
Hints
- One way to get started is to modify mr/worker.go's Worker() to send an RPC to the master asking for a task. Then modify the master to respond with the file name of an as-yet-unstarted map task. Then modify the worker to read that file and call the application Map function, as in mrsequential.go.
- The application Map and Reduce functions are loaded at run-time using the Go plugin package, from files whose names end in .so.
- If you change anything in the mr/ directory, you will probably have to re-build any MapReduce plugins you use, with something like go build -buildmode=plugin ../mrapps/wc.go
- This lab relies on the workers sharing a file system. That's straightforward when all workers run on the same machine, but would require a global filesystem like GFS if the workers ran on different machines.
- A reasonable naming convention for intermediate files is mr-X-Y, where X is the Map task number, and Y is the reduce task number.
- The worker's map task code will need a way to store intermediate key/value pairs in files in a way that can be correctly read back during reduce tasks. One possibility is to use Go's encoding/json package. To write key/value pairs to a JSON file:
enc := json.NewEncoder(file)
for _, kv := ... {
err := enc.Encode(&kv)
and to read such a file back:
dec := json.NewDecoder(file)
for {
var kv KeyValue
if err := dec.Decode(&kv); err != nil {
break
}
kva = append(kva, kv)
}
- The map part of your worker can use the ihash(key) function (in worker.go) to pick the reduce task for a given key.
- You can steal some code from mrsequential.go for reading Map input files, for sorting intermedate key/value pairs between the Map and Reduce, and for storing Reduce output in files.
- The master, as an RPC server, will be concurrent; don't forget to lock shared data.
- Use Go's race detector, with go build -race and go run -race. test-mr.sh has a comment that shows you how to enable the race detector for the tests.
- Workers will sometimes need to wait, e.g. reduces can't start until the last map has finished. One possibility is for workers to periodically ask the master for work, sleeping with time.Sleep() between each request. Another possibility is for the relevant RPC handler in the master to have a loop that waits, either with time.Sleep() or sync.Cond. Go runs the handler for each RPC in its own thread, so the fact that one handler is waiting won't prevent the master from processing other RPCs.
- The master can't reliably distinguish between crashed workers, workers that are alive but have stalled for some reason, and workers that are executing but too slowly to be useful. The best you can do is have the master wait for some amount of time, and then give up and re-issue the task to a different worker. For this lab, have the master wait for ten seconds; after that the master should assume the worker has died (of course, it might not have).
- test-mr.sh runs all the processes in the sub-directory mr-tmp, so if something goes wrong and you want to look at intermediate or output files, look there.
看完之后思考如下:
按照RPC,实现worker的Map和Reduce,算法过程见前面。
Master如何分配Map任务的和reduce任务?
分配Map任务,用轮询的方式
分配reduce任务?所有的Map完成之后才能进行reduce,所以直接给NReduce个reduce任务。
Master分配任务是一个状态机的过程。
其他细节
1. master并发问题(加锁即可)
2. crash问题,文件先用tmp文件,由master负责统一的写。
3. Worker断开问题。
核心:Worker的Map Reduce算法,Master的状态机调度算法。
来源:oschina
链接:https://my.oschina.net/u/4409755/blog/4840230