MapReduce Job to Collect All Unique Fields in HDFS Directory of JSON

问题

My question is in essence the application of this referenced question:

Convert JSON to Parquet

I find myself in the rather unique position of having to semi-manually curate an Avro schema for the superset of fields contained in JSON files (composed of arbitrary combinations of known resources)in an HDFS directory.

This is part of an ETL pipeline I am trying to develop to convert these files to parquet for much more efficient/easier processing in Spark. I have never written a MapReduce program before, so I am starting from scratch. If anyone has encounter this type of problem before, I would appreciate any insights. Thanks!

来源：https://stackoverflow.com/questions/35495041/mapreduce-job-to-collect-all-unique-fields-in-hdfs-directory-of-json

标签

json

Hadoop

MapReduce

avro

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!