Outputting to a file in HDFS using a subprocess

问题

I have a script that reads in text line by line, modifies the line slightly, and then outputs the line to a file. I can read the text into the file fine, the problem is that I cannot output the text. Here is my code.

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"], stdout=subprocess.PIPE)
for line in cat.stdout:
    line = line+"Blah";
    subprocess.Popen(["hadoop", "fs", "-put", "/user/test/moddedfile.txt"], stdin=line)

This is the error I am getting.

AttributeError: 'str' object has no attribute 'fileno'
cat: Unable to write to output stream.

回答1:

Hard and quick way to make work your code:

import subprocess
from tempfile import NamedTemporaryFile

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
                       stdout=subprocess.PIPE)

with NamedTemporaryFile() as f:
    for line in cat.stdout:
        f.write(line + 'Blah')

    f.flush()
    f.seek(0)

    cat.wait()

    put = subprocess.Popen(["hadoop", "fs", "-put", f.name,  "/user/test/moddedfile.txt"],
                           stdin=f)
    put.wait()

But I suggest You to look at hdfs/webhdfs python libraries.

For example pywebhdfs.

回答2:

stdin argument doesn't accept a string. It should be PIPE, None or an existing file (something with valid .fileno() or an integer file descriptor).

from subprocess import Popen, PIPE

cat = Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
            stdout=PIPE, bufsize=-1)
put = Popen(["hadoop", "fs", "-put", "-", "/user/test/moddedfile.txt"],
            stdin=PIPE, bufsize=-1)
for line in cat.stdout:
    line += "Blah"
    put.stdin.write(line)

cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()

来源：https://stackoverflow.com/questions/22349733/outputting-to-a-file-in-hdfs-using-a-subprocess

标签

python

subprocess

HDFS