How do you use Python UDFs with Pig in Elastic MapReduce?

前端 未结 4 492
深忆病人
深忆病人 2020-12-11 08:20

I really want to take advantage of Python UDFs in Pig on our AWS Elastic MapReduce cluster, but I can\'t quite get things to work properly. No matter what I try, my pig job

相关标签:
4条回答
  • 2020-12-11 08:25

    After quite a few wrong turns, I found that, at least on the elastic map reduce implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. I found instead that I could control the class path using the HADOOP_CLASSPATH variable instead.

    Once I made that realization, it was fairly easy to get things setup to use Python UDFS:

    • Install Jython
      • sudo apt-get install jython -y -qq
    • Set the HADOOP_CLASSPATH environment variable.
      • export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
        • jython.jar ensures that Hadoop can find the PyException class
        • antlr-runtime-3.2.jar ensures that Hadoop can find the CharStream class
    • Create the cache directory for Jython (this is documented in the Jython FAQ)
      • sudo mkdir /usr/share/java/cachedir/
      • sudo chmod a+rw /usr/share/java/cachedir

    I should point out that this seems to directly contradict other advice I found while searching for solutions to this problem:

    • Setting the CLASSPATH and PIG_CLASSPATH environment variables doesn't seem to do anything.
    • The .py file containing the UDF does not need to be included in the HADOOP_CLASSPATH environment variable.
    • The path to the .py file used in the Pig register statement may be relative or absolute, it doesn't seem to matter.
    0 讨论(0)
  • 2020-12-11 08:29

    As of today, using Pig 0.9.1 on EMR, I found the following is sufficient:

    env HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/path/to/jython.jar pig -f script.pig
    

    where script.pig does a register of the Python script, but not jython.jar:

    register Pig-UDFs/udfs.py using jython as mynamespace;
    
    0 讨论(0)
  • 2020-12-11 08:34

    I faced the same problem recently. Your answer can be simplified. You don't need to install jython at all or create the cache directory. You do need to include the jython jar in the EMR bootstrap script (or do something similar). I wrote an EMR bootstrap script with the following lines. One can simplify this even further by not using s3cmd at all, but by using your job flow (to place the files in a certain directory). Getting the UDF via s3cmd is definitely inconvenient, however, I was unable to register a udf file on s3 when using the EMR version of pig.

    If you are using CharStream, you have to include that jar as well to the piglib path. Depending on the framework you use, you can pass these bootstrap scripts as options to your job, EMR supports this via their elastic-mapreduce ruby client. A simple option is to place the bootstrap scripts on s3.

    If you are using s3cmd in the bootstrap script, you need another bootstrap script that does something like this. This script should be placed before the other in bootstrap order. I am moving away from using s3cmd, but for my successful try, s3cmd did the trick. Also, the s3cmd executable is already installed in the pig image for amazon (e.g. ami version 2.0 and hadoop version 0.20.205.

    Script #1 (Seeding s3cmd)

    #!/bin/bash
    cat <<-OUTPUT > /home/hadoop/.s3cfg
    [default]
    access_key = YOUR KEY
    bucket_location = US
    cloudfront_host = cloudfront.amazonaws.com
    cloudfront_resource = /2010-07-15/distribution
    default_mime_type = binary/octet-stream
    delete_removed = False
    dry_run = False
    encoding = UTF-8
    encrypt = False
    follow_symlinks = False
    force = False
    get_continue = False
    gpg_command = /usr/local/bin/gpg
    gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %  (passphrase_fd)s -o %(output_file)s %(input_file)s
    gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
    gpg_passphrase = YOUR PASSPHRASE
    guess_mime_type = True
    host_base = s3.amazonaws.com
    host_bucket = %(bucket)s.s3.amazonaws.com
    human_readable_sizes = False
    list_md5 = False
    log_target_prefix =
    preserve_attrs = True
    progress_meter = True
    proxy_host =
    proxy_port = 0
    recursive = False
    recv_chunk = 4096
    reduced_redundancy = False
    secret_key = YOUR SECRET
    send_chunk = 4096
    simpledb_host = sdb.amazonaws.com
    skip_existing = False
    socket_timeout = 10
    urlencoding_mode = normal
    use_https = False
    verbosity = WARNING
    OUTPUT
    

    Script #2 (seeding jython jars)

    #!/bin/bash
    set -e
    
    s3cmd get <jython.jar>
    # Very useful for extra libraries not available in the jython jar. I got these libraries from the 
    # jython site and created a jar archive.
    s3cmd get <jython_extra_libs.jar>
    s3cmd get <UDF>
    
    PIG_LIB_PATH=/home/hadoop/piglibs
    
    mkdir -p $PIG_LIB_PATH
    
    mv <jython.jar> $PIG_LIB_PATH
    mv <jython_extra_libs.jar> $PIG_LIB_PATH
    mv <UDF> $PIG_LIB_PATH
    
    # Change hadoop classpath as well.
    echo "HADOOP_CLASSPATH=$PIG_LIB_PATH/<jython.jar>:$PIG_LIB_PATH/<jython_extra_libs.jar>" >>    /home/hadoop/conf/hadoop-user-env.sh
    
    0 讨论(0)
  • 2020-12-11 08:41

    Hmm...to clarify some of what I just read here, at this point using a python UDF in Pig running on EMR stored on s3, it's as simple as this line in your pig script:

    REGISTER 's3://path/to/bucket/udfs.py' using jython as mynamespace

    That is, no classpath modifications necessary. I'm using this in production right now, though with the caveat that I'm not pulling in any additional python modules in my udf. I think that may affect what you need to do to make it work.

    0 讨论(0)
提交回复
热议问题