Read specific column from Parquet without using Spark

匿名 (未验证) 提交于 2019-12-03 01:36:02

问题:

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:

import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.avro.generic.GenericRecord import org.apache.parquet.hadoop.ParquetReader import org.apache.parquet.avro.AvroParquetReader  object parquetToJson{   def main (args : Array[String]):Unit= {  //case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String) val parquetFilePath = new Path("data/parquet/Customer/") val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]] val iter = Iterator.continually(reader.read).takeWhile(_ != null) val list = iter.toList list.foreach(record => println(record)) } } 

The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.

回答1:

If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).

In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!