Consolidate MapReduce logs

问题

Debugging Hadoop map-reduce jobs is a pain. I can print out to stdout, but these logs show up on all of the different machines on which the MR job was run. I can go to the jobtracker, find my job, and click on each individual mapper to get to its task log, but this is extremely cumbersome when you have 20+ mapper/reducers.

I was thinking that I might have to write a script that would scape through the job tracker to figure out what machine each of the mappers/reducers ran on and then scp the logs back to one central location where they could be cat'ed together. Before I waste my time doing this, does someone know of a better way to get one, consolidated stdout log for a job's mappers and reducers?

回答1:

I do this the following way:

For general debugging (i.e. testing that the job works) I run hadoop in a standalone mode on my local machine with a small sample of the data. This way hadoop works as any other java app and shows the stdout of mappers or reducers in the console.

For specific bugs (i.e. the job runs fine in my local machine, but dies in production) I just tweak the code to put as the job's output what I would normally send to stdout when debugging. That way you can check the job's result for debugging insights. This is not pretty but it works fine.

Another option is to check the node's logs in the jobtracker. They have all the stdout and stderr. However, for several reasons I have found this to be much complicated that the solution described above (logs are deleted after some time, several nodes to look for, etc)

回答2:

So I ended up just creating a Python script to do this. It wasn't horrible. Here's the script in case anyone else wants to use it. Obviously it needs more error checking, not hard-coded urls, etc but you get the idea. Note, you need to download Beautiful Soup

#!/usr/bin/python
import sys
from bs4 import BeautifulSoup as BS
from urllib2 import urlopen
import re

TRACKER_BASE_URL = 'http://my.tracker.com:50030/'
trackerURLformat = TRACKER_BASE_URL + 'jobtasks.jsp?jobid=%s&type=%s&pagenum=1' # use map or reduce for the type

def findLogs(url):
    finalLog = ""

    print "Looking for Job: " + url
    html = urlopen(url).read()
    trackerSoup = BS(html)
    taskURLs = [h.get('href') for h in trackerSoup.find_all(href=re.compile('taskdetails'))]

    # Now that we know where all the tasks are, go find their logs
    logURLs = []
    for taskURL in taskURLs:
        taskHTML = urlopen(TRACKER_BASE_URL + taskURL).read()
        taskSoup = BS(taskHTML)
        allLogURL = taskSoup.find(href=re.compile('all=true')).get('href')
        logURLs.append(allLogURL)

    # Now fetch the stdout log from each
    for logURL in logURLs:
        logHTML = urlopen(logURL).read()
        logSoup = BS(logHTML)
        stdoutText = logSoup.body.pre.text.lstrip()
        finalLog += stdoutText

    return finalLog


def main(argv):
    with open(argv[1] + "-map-stdout.log", "w") as f:
        f.write(findLogs(trackerURLformat % (argv[1], "map")))
        print "Wrote mapers stdouts to " + f.name

    with open(argv[1] + "-reduce-stdout.log", "w") as f:
        f.write(findLogs(trackerURLformat % (argv[1], "reduce")))
        print "Wrote reducer stdouts to " + f.name

if __name__ == "__main__":
    main(sys.argv)

回答3:

My experience is that you don't need to click through 20+ map/reduce output links when you know exactly what map/reduce attempt caused the problem you want to examine through logs. That's why I always use Context.setStatus("Warn message here") when I throw exception or increment counter that is there to raise suspicion.

More on setStatus: http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html#setStatus(java.lang.String)

https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-5/running-on-a-cluster (Section Debugging a Job)

来源：https://stackoverflow.com/questions/18518983/consolidate-mapreduce-logs

标签

logging

Hadoop

MapReduce