distributed-system

How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?

£可爱£侵袭症+ 提交于 2019-11-29 07:34:32
I am creating a program to analyze PDF, DOC and DOCX files. These files are stored in HDFS. When I start my MapReduce job, I want the map function to have the Filename as key and the Binary Contents as value. I then want to create a stream reader which I can pass to the PDF parser library. How can I achieve that the key/value pair for the Map Phase is filename/filecontents? I am using Hadoop 0.20.2 This is older code that starts a job: public static void main(String[] args) throws Exception { JobConf conf = new JobConf(PdfReader.class); conf.setJobName("pdfreader"); conf.setOutputKeyClass(Text

paxos vs raft for leader election

我怕爱的太早我们不能终老 提交于 2019-11-28 21:59:32
After reading paxos and raft paper, I have following confusion: paxos paper only describe consensus on single log entry, which is equivalent the leader election part of the raft algorithm. What's the advantage of paxos's approach over the simple random timeout approach in raft's leader election? It is a common misconception that the original Paxos papers don't use a stable leader. In Paxos Made Simple on page 6 in the section entitled “The Implementation” Lamport wrote: The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner. This is

NoSQL and eventual consistency - real world examples [closed]

為{幸葍}努か 提交于 2019-11-28 19:17:36
问题 I'm looking for good examples of NoSQL apps that portray how to work with lack of transactionality as we know it in relational databases. I'm mostly interested in write-intensive code, as for mostly read-only code this is a much easier task. I've read a number of things about NoSQL in general, about CAP theorem, eventual consistency etc. However those things tend to concentrate on the database architecture for its own sake and not on the design patterns to use with it. I do understand that it