问题
I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path
as input (even for pointing to a local file).
This has been asked several times, but quite long ago, and all answers are coupled to Hadoop.
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord nextRecord = reader.read();
is the most popular answer in how to read a parquet file, in a standalone java code?, but requires a Hadoop Path
and has now been deprecated for a mysterious InputFile instead. The only implementation of InputFile
I can find is HadoopInputFile
, so again no help.
In Avro this is a simple:
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
this.dataFileReader = new DataFileReader<>(file, datumReader);
(where file is java.io.File
). What's the Parquet equivalent?
I am asking for no Hadoop Path
dependency in the answers, because Hadoop drags in bloat and jar hell, and it seems silly to require it for reading local files.
To further explain the backstory, I maintain a small IntelliJ plugin that allows users to drag-and-drop Avro files into a pane for viewing in a table. This plugin is currently 5MB. If I include Parquet and Hadoop dependencies, it bloats to over 50MB, and doesn't even work.
POST-ANSWER ADDENDUM
Now that I have it working (thanks to the accepted answer), here is my working solution that avoids all the annoying errors that can be dragged in by depending heavily on the Hadoop Path
API:
- ParquetReader.java
- LocalInputFile.java
回答1:
Unfortunately the java parquet implementation is not independent of some hadoop libraries. There is an existing issue in their bugtracker to make it easy to read and write parquet files in java without depending on hadoop but there does not seem to be much progress on it. The InputFile
interface was added to add a bit of decoupling, but a lot of the classes that implement the metadata part of parquet and also all compression codecs live inside the hadoop dependency.
I found another implementation of InputFile in the smile library, this might be more efficient than going through the hadoop filesystem abstraction, but does not solve the dependency problem.
As other answers already mention, you can create an hadoop Path
for a local file and use that without problems.
java.io.File file = ...
new org.apache.hadoop.fs.Path(file.toURI())
The dependency tree that is pulled in by hadoop can be reduced a lot by defining some exclusions. I'm using the following to reduce the bloat (using gradle syntax):
compile("org.apache.hadoop:hadoop-common:3.1.0") {
exclude(group: 'org.slf4j')
exclude(group: 'org.mortbay.jetty')
exclude(group: 'javax.servlet.jsp')
exclude(group: 'com.sun.jersey')
exclude(group: 'log4j')
exclude(group: 'org.apache.curator')
exclude(group: 'org.apache.zookeeper')
exclude(group: 'org.apache.kerby')
exclude(group: 'com.google.protobuf')
}
回答2:
If the need for not using Hadoop is really unavoidable, you can try Spark and run it in a local version. A quick start guide can be find here: https://spark.apache.org/docs/latest/index.html. For downloading, you can download at this link: https://archive.apache.org/dist/spark/ (find a version you like, there is always a build without hadoop. Unfortunately, the size of compressed version is still around 10-15M). You will also able to find some Java example at examples/src/main.
After that, you can read the file in as a Spark Dataframe like this
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
SparkSession spark = SparkSession.builder().appName("Reducing dependecy by adding more dependencies").master("local[*]").getOrCreate();
DataFrame parquet = sqlContext.read().parquet("C:/files/myfile.csv.parquet");
parquet.show(20);
This solution do satisfy the original conditions in the question. However, it doesn't devoid from the fact that it's like beating around the bush (but hell yeah it's funny). Still, it might helps to open a new possible way to tackle this.
回答3:
You can use ParquetFileReader class for that
dependencies {
compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.2.0'
compile group: 'org.apache.parquet', name: 'parquet-hadoop', version: '1.10.1'
}
You can specify your local directory path here
private static Path path = new Path("file:\\C:\\myfile.snappy.parquet");
ParquetFileReader r = new ParquetFileReader(conf, path, footer);
来源:https://stackoverflow.com/questions/59939309/read-local-parquet-file-without-hadoop-path-api