How to get list of all columns from a parquet file using s3 select?

99封情书 提交于 2019-12-20 07:19:13

问题


I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows wihtout any column headers.

Is there anyway to get all column names from this parquet file without downloading it completely? Since parquet file can be very large, I would not want to download the entire parquet file which is why I am using s3 select to pick first few rows using

select * from S3Object LIMIT 10

I tried to fetch column names explicitly by doing

SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'S3Object'

but it would not work as AWS S3 doesn't support this yet.

Is there any other way to achieve the same?


回答1:


I have the same problem but unfortunately, my Google-Kung-Fu was not strong enough this time.

I found the following workaround which I don't really like but it works for me:

r = s3.select_object_content(Bucket='...your bucket...', 
                             Key='...your key...', 
                             ExpressionType='SQL', 
                             Expression="select s.* from S3Object s limit 1",
                             InputSerialization={'Parquet': {}}, 
                              OutputSerialization={'JSON': {}})
row = json.loads([rec['Records']['Payload'].decode('utf-8') for rec in r['Payload'] if 'Records' in rec][0])

print("Columns: ", list(row.keys()))

I.e. what the code does is to request the first line of the data, to extract the payload and to load the returned JSON object. The obtained JSON object has the following structure {"Column name": "value", ....}, such that one only has to extract the keys of the JSON object (last line).

One additional problem is that this does not return the types of columns. This is something that I could not solve yet.

UPDATE: I observed that in some situations, some column names were not correct reported. Instead of the real name something like _18, _19 were returned. No idea how to deal with it.



来源:https://stackoverflow.com/questions/57451552/how-to-get-list-of-all-columns-from-a-parquet-file-using-s3-select

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!