Python write to hdfs file

时光毁灭记忆、已成空白 提交于 2021-02-18 22:00:58

问题


What is the best way to create/write/update a file in remote HDFS from local python script?

I am able to list files and directories but writing seems to be a problem.

I have searched hdfs and snakebite but none of them give a clean way to do this.


回答1:


try HDFS liberary.. its really good You can use write(). https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write

Example:

to create connection:

from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')

from json import dump, dumps
records = [
  {'name': 'foo', 'weight': 1},
  {'name': 'bar', 'weight': 2},
]

# As a context manager:
with client.write('data/records.jsonl', encoding='utf-8') as writer:
  dump(records, writer)

# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')

For CSV you can do

import pandas as pd
df=pd.read.csv("file.csv")
with client_hdfs.write('path/output.csv', encoding = 'utf-8') as writer:
  df.to_csv(writer)



回答2:


What's wrong with other answers

They use WebHDFS, which is not enabled by default, and insecure without Kerberos or Apache Knox.

This is what the upload function of that hdfs library you linked to uses.

Native (more secure) ways to write to HDFS using Python

You can use pyspark.

Example - How to write pyspark dataframe to HDFS and then how to read it back into dataframe?


snakebite has been mentioned, but it doesn't write files


pyarrow has a FileSystem.open() function that should be able to write to HDFS as well, though I've not tried.




回答3:


Without using a complicated library built for HDFS, you can also simply use the requests package in python for HDFS as:

import requests
from json import dumps
params = (
('op', 'CREATE')
)
data = dumps(file)  # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data)

If response is 200, then your connection is working! This technique lets you use all the utitities given by Hadoop's RESTful API: ls, md, get, post, etc.

You can also convert CURL commands to python through this:

  1. Get Command for HDFS: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
  2. Convert to python: https://curl.trillworks.com/

Hope this helps!



来源:https://stackoverflow.com/questions/47926758/python-write-to-hdfs-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!