How do I inspect the content of a Parquet file from the command line?
The only option I see now is
$ hadoop fs -get my-path local-file
$ parquet-tool
On Windows 10 x64 I ended up building parquet-reader just now from source:
Installed WSL with Ubuntu LTS 18.04. Upgraded gcc to v9.2.1 and CMake to latest. Bonus: install Windows Terminal.
git checkout https://github.com/apache/arrow
cd arrow
cd cpp
mkdir buildgcc
cd buildgcc
cmake .. -DPARQUET_BUILD_EXECUTABLES=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DPARQUET_BUILD_EXAMPLES=ON -DARROW_CSV=ON
make -j 20
cd release
./parquet-reader
Usage: parquet-reader [--only-metadata] [--no-memory-map] [--json] [--dump] [--print-key-value-metadata] [--columns=...]
If it has trouble building, may have to use vcpkg for the missing libraries.
Also see a another solution that offers less, but in a simpler way: https://github.com/chhantyal/parquet-cli
Linked from: How can I write streaming/row-oriented data using parquet-cpp without buffering?
Initially tried brew install parquet-tools, but this did not appear to work under my install of WSL
Same as above. Use CMake to generate the Visual Studio 2019 project, then build.
git checkout https://github.com/apache/arrow
cd arrow
cd cpp
mkdir buildmsvc
cd buildmsvc
cmake .. -DPARQUET_BUILD_EXECUTABLES=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DPARQUET_BUILD_EXAMPLES=ON -DARROW_CSV=ON
# Then open the generated .sln file in MSVC and build. Everything should build perfectly.
Troubleshooting:
In case there was any missing libraries, I pointed it at my install of vcpkg. I ran vcpkg integrate install, then copied the to the end of the CMake line:
-DCMAKE_TOOLCHAIN_FILE=[...path...]/vcpkg/scripts/buildsystems
If it had complained about any missing libraries, I would have installed these, e.g. boost, etc using commands like vcpkg install boost:x64.