问题
I have a file stored in HDFS as part-m-00000.gz.parquet
I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet
but it's compressed, so I ran gunzip part-m-00000.gz.parquet
but it doesn't uncompress the file since it doesn't recognise the .parquet
extension.
How do I get the schema / column names for this file?
回答1:
You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.
And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.
Check out the parquet-tool project (which is put simply, a jar file.) parquet-tools
Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is
parquet-tools schema part-m-00000.parquet
Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce
回答2:
If your Parquet files are located in HDFS or S3 like me, you can try something like the following:
HDFS
parquet-tools schema hdfs://<YOUR_NAME_NODE_IP>:8020/<YOUR_FILE_PATH>/<YOUR_FILE>.parquet
S3
parquet-tools schema s3://<YOUR_BUCKET_PATH>/<YOUR_FILE>.parquet
Hope it helps.
回答3:
Since it is not a text file, you cannot do a "-text" on it. You can read it easily through Hive even if you do not have the parquet-tools installed, if you can load that file to a Hive table.
来源:https://stackoverflow.com/questions/33883640/how-do-i-get-schema-column-names-from-parquet-file