Pig Udf in displaying result

允我心安 提交于 2019-12-13 12:16:20

问题


I am new to pig and I have written an udf in java and I have included a

System.out.println

statement in it. I have to know where this statement get printed while running in pig.


回答1:


If you register and use this UDF in your pig script and then the output is stored in a pig log file such as stdoutlogs.




回答2:


Assuming your UDF extends EvalFunc, you can use the Logger returned from EvalFunc.getLogger(). The log output should be visible in the associated Map / Reduce task that pig executes (if the job executes in more than a single stage then you'll have to pick through them to find the associated log entries).

the logs will end up in the Map Reduce Task log file.I advise debugging your UDF in local mode before deploying on a cluster,so that you can debug it from IDE like eclipse.

By default errors (e.g: script parsing errors) are logged to pig.logfile which can be set in $PIG_HOME/conf/pig.properties. If you want to log status messages too, then prepare a valid log4j.properties file and set it in the log4jconf property.

When using Pig v0.10.0 (r1328203) I found that a successful pig task doesn't write the job's history logs to the output directory on hdfs. (hadoop.job.history.user.location=${mapred.output.dir}/_logs/history/)

If you want to have these histories by all means then set mapred.output.dir in your pig script in this way:

set mapred.output.dir '/user/hadoop/test/output';

Note: Pig uses apache's log4j module for logging. However, it would be daunting to figure out why you are not able to use log4j. properties with pig, as sometimes you might get NPE with a custom root logger.

Pig has a command line option -4 ( yes, quite not as intuitive as one could relate it to log4j) to use with log4j.

Here is a sample usage with sample log4j.properties example.

option -l is used to to name the log file t

pig -l /tmp/some.pig.log -4 log4j.properties -x local mysample.pig (script)

cat log4j.properties

# Root logger option
log4j.rootLogger=INFO, file, F
# Direct log messages to a log file
log4j.logger.org.apache.pig=DEBUG
log4j.logger.org.apache.hadoop=INFO
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=${pig.logfile}
log4j.appender.file.MaxFileSize=1MB
log4j.appender.file.MaxBackupIndex=1
log4j.appender.file.layout=org.apache.log4j.PatternLayout
#log4j.appender.file.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m%n
log4j.appender.file.layout.ConversionPattern=%d{ABSOLUTE} %5p [%t] (%F:%L) - %m%n

#another example line below for a different format of output log line
# log4j.appender.file.layout.ConversionPattern="%d [%t] %-5p %c - %m%n"

The output of the above pig command is stored in the file /tmp/some.pig.log in a typical apache log4j format.

Please look at apache log4j documentation for different Appenders, Consoles and their respective format output in the log. Or let me know if you are looking for a specific format or redirect option.




回答3:


If you are running pig on single machine say your local computer then the System.out.println logs will be displayed all with all the things that are printed on the terminal But if the pig script is run on cluster then you wont see the print messages. Bizzarreee... Hmm..

If you think a little deeper every task is being run on separate machine and hence the print messages are there on the individual machine on cluster and hence you wont see it on your machine.

Now what is the solution to it, The process is a little bit tedious bear with me.

The url to track the job: http://ip-172-31-29-193.us-west-2.compute.internal:20888/proxy/application_1443585172695_0019/

Open it in browser, when you try to open it, it will fail to open as the ip is local one. Say you are using an EMR cluster then get the public it in my case it is

Master public DNS:ec2-52-89-98-140.us-west-2.compute.amazonaws.com

Now replace the public ip in the url above to change it to

ec2-52-89-98-140.us-west-2.compute.amazonaws.com:20888/proxy/application_1443585172695_0019/

After executing this you will notice that the url has changed

Some private ip then job history server

http://ip-172-31-29-193.us-west-2.compute.internal:19888/jobhistory/job/job_1443585172695_0019/

Again replace the private ip

ec2-52-89-98-140.us-west-2.compute.amazonaws.com:19888/jobhistory/job/job_1443585172695_0019/

By now you should come to this page

Now determine whether your task(The point where UDF is called) is executed in mapper or reducer phase(before or after groupby) and click on the links

Now go to the terminal where the logs are there. And find the step where your variable is computed and get the jobid from there

my jobid is job_1443585172695_0021

Now in the previous step lets say your variable lies in reduce phase click on that and you will get screen similar to . Get the private IP from there which is 172-31-28-99 for my case.

Now go to the EMR page

CLICK ON HARDWARE INSTANCES AND CLICK ON VIEW EC2 INSTANCES

You will get something similar to

. Now get the public ip corresponding to the private IP in my case it is 52.25.196.219

Now open the url publicip:8042

ie 52.25.196.219:8042 to get something similar to . Click on tool in the left side and then click local logs.

Wait a little longer almost there.

You will get another page now nagivate

click on Container --> YOUR JOB ID (which we found in image 2)(in my case it was application_1443585172695_0021/ 4096 bytes Sep 30, 2015 5:28:53 AM) ---> then there would be many files with container as prefix, open one and you will find stdout Directory open it to see the system.out.println message.

Well Here you have your logs. Phew. That was some troublesome work. Do a couple of times and you will be pro at it.

Couple of things to remember 1) Test UDF on local machine 2) Learn Unit test cases helps a lot in debugging

Above 2 things will save all the trouble of finding the logs

There is a method to find the actual container number But i forgot, if anyone knows please do let me know.

PS: Sorry if the answer was too long. Thought of explaining it properly, and pardon me for the english.



来源:https://stackoverflow.com/questions/24795301/pig-udf-in-displaying-result

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!