I have been looking into how Spark stores statistics (min/max) in Parquet as well as how it uses the info for query optimization. I have got a few questions. First setup: Sp
PARQUET-686 made changes to intentionally ignore statistics on binary field when it seems to be appropriate. You can override this behavior by setting parquet.strings.signed-min-max.enabled
to true
.
After setting that config, you can read min/max in binary field with parquet-tools.
More details in my another stackoverflow question