Reading / Writing Files from hdfs using python with subprocess, Pipe, Popen gives error

问题

I am trying to read(open) and write files in hdfs inside a python script. But having error. Can someone tell me what is wrong here.

Code (full): sample.py

#!/usr/bin/python

from subprocess import Popen, PIPE

print "Before Loop"

cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
            stdout=PIPE)

print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=PIPE)

print "After Loop 2"
for line in cat.stdout:
    line += "Blah"
    print line
    print "Inside Loop"
    put.stdin.write(line)

cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()

When I execute :

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -file ./sample.py -mapper './sample.py' -input sample.txt -output fileRead

It executes properly I couldn't find the file which supposed to create in hdfs modifiedfile

And When I execute :

 hadoop fs -getmerge ./fileRead/ file.txt

Inside the file.txt, I got :

Before Loop 
Before Loop 
After Loop 1    
After Loop 1    
After Loop 2    
After Loop 2

Can someone please tell me what I am doing wrong here ?? I dont think it reads from the sample.txt

回答1:

Try to change your put sub process to take the cat stdout on its own by changing this

put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=PIPE)

into this

put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=cat.stdout)

Full script:

#!/usr/bin/python

from subprocess import Popen, PIPE

print "Before Loop"

cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
            stdout=PIPE)

print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=cat.stdout)
put.communicate()

回答2:

Can someone please tell me what I am doing wrong here ??

Your sample.py might not be a proper mapper. A mapper probably accepts its input on stdin and writes the result to its stdout e.g., blah.py:

#!/usr/bin/env python
import sys

for line in sys.stdin: # print("Blah\n".join(sys.stdin) + "Blah\n")
    line += "Blah"
    print(line)

Usage:

$ hadoop ... -file ./blah.py -mapper './blah.py' -input sample.txt -output fileRead

来源：https://stackoverflow.com/questions/28139406/reading-writing-files-from-hdfs-using-python-with-subprocess-pipe-popen-give

标签

python

Hadoop

HDFS

popen

hadoop-streaming