问题
My question is in essence the application of this referenced question:
Convert JSON to Parquet
I find myself in the rather unique position of having to semi-manually curate an Avro schema for the superset of fields contained in JSON files (composed of arbitrary combinations of known resources)in an HDFS directory.
This is part of an ETL pipeline I am trying to develop to convert these files to parquet for much more efficient/easier processing in Spark. I have never written a MapReduce program before, so I am starting from scratch. If anyone has encounter this type of problem before, I would appreciate any insights. Thanks!
来源:https://stackoverflow.com/questions/35495041/mapreduce-job-to-collect-all-unique-fields-in-hdfs-directory-of-json