Outputting to a file in HDFS using a subprocess

我与影子孤独终老i 提交于 2020-01-16 01:20:32

问题


I have a script that reads in text line by line, modifies the line slightly, and then outputs the line to a file. I can read the text into the file fine, the problem is that I cannot output the text. Here is my code.

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"], stdout=subprocess.PIPE)
for line in cat.stdout:
    line = line+"Blah";
    subprocess.Popen(["hadoop", "fs", "-put", "/user/test/moddedfile.txt"], stdin=line)

This is the error I am getting.

AttributeError: 'str' object has no attribute 'fileno'
cat: Unable to write to output stream.

回答1:


Hard and quick way to make work your code:

import subprocess
from tempfile import NamedTemporaryFile

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
                       stdout=subprocess.PIPE)

with NamedTemporaryFile() as f:
    for line in cat.stdout:
        f.write(line + 'Blah')

    f.flush()
    f.seek(0)

    cat.wait()

    put = subprocess.Popen(["hadoop", "fs", "-put", f.name,  "/user/test/moddedfile.txt"],
                           stdin=f)
    put.wait()

But I suggest You to look at hdfs/webhdfs python libraries.

For example pywebhdfs.




回答2:


stdin argument doesn't accept a string. It should be PIPE, None or an existing file (something with valid .fileno() or an integer file descriptor).

from subprocess import Popen, PIPE

cat = Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
            stdout=PIPE, bufsize=-1)
put = Popen(["hadoop", "fs", "-put", "-", "/user/test/moddedfile.txt"],
            stdin=PIPE, bufsize=-1)
for line in cat.stdout:
    line += "Blah"
    put.stdin.write(line)

cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()


来源:https://stackoverflow.com/questions/22349733/outputting-to-a-file-in-hdfs-using-a-subprocess

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!