parquet

Efficient reading nested parquet column in Spark

六眼飞鱼酱① 提交于 2019-11-30 17:46:09
问题 I have following (simplified) schema: root |-- event: struct (nullable = true) | |-- spent: struct (nullable = true) | | |-- amount: decimal(34,3) (nullable = true) | | |-- currency: string (nullable = true) | | | | ... ~ 20 other struct fields on "event" level I'm trying to sum on nested field spark.sql("select sum(event.spent.amount) from event") According to spark metrics I'm reading 18 GB from disk and it takes 2.5 min. However when I select the top level field: spark.sql("select sum

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

▼魔方 西西 提交于 2019-11-30 17:37:13
Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala? Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet): public static Map<String, Object> read(InputStream inputStream) throws IOException { ObjectMapper objectMapper = new ObjectMapper(); return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() { }); } public static void write(OutputStream outputStream, Map<String, Object> map) throws IOException { ObjectMapper

How to suppress parquet log messages in Spark?

流过昼夜 提交于 2019-11-30 17:16:31
How to stop such messages from coming on my spark-shell console. 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 89213 records. 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 2 ms. row count = 120141 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:

how to read a parquet file, in a standalone java code? [closed]

半腔热情 提交于 2019-11-30 11:28:21
the parquet docs from cloudera shows examples of integration with pig/hive/impala. but in many cases I want to read the parquet file itself for debugging purposes. is there a straightforward java reader api to read a parquet file ? Thanks Yang You can use AvroParquetReader from parquet-avro library to read a parquet file as a set of AVRO GenericRecord objects. rishiehari Old method: (deprecated) AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file); GenericRecord nextRecord = reader.read(); New method: ParquetReader<GenericRecord> reader = AvroParquetReader.

Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?

霸气de小男生 提交于 2019-11-30 10:19:33
I have to load up a CSV file from HDFS using Spark into DataFrame . I was wondering if there is a "performance" improvement (query speed) from a DataFrame backed by a CSV file vs one backed by a parquet file? Typically, I load a CSV file like the following into a data frame. val df1 = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("hdfs://box/path/to/file.csv") On the other hand, loading a parquet file (assuming I've parsed the CSV file, created a schema, and saved it to HDFS) looks like the following. val df2 = sqlContext

spark-shell读取parquet文件

梦想的初衷 提交于 2019-11-30 05:40:59
1、进入spark-shell窗口 2、 val sqlContext = new org.apache.spark.sql.SQLContext(sc) 3、 val parquetFile = sqlContext.parquetFile("hdfs://cdp/user/az-user/sparkStreamingKafka2HdfsData/part-00000-ff60a7d3-bf91-4717-bd0b-6731a66b9904-c000.snappy.parquet") hdfs://cdp是defaultFS,也可以不写,如下: val parquetFile2 = sqlContext.parquetFile("/user/az-user/sparkStreamingKafka2HdfsData/part-00000-ff60a7d3-bf91-4717-bd0b-6731a66b9904-c000.snappy.parquet") 4、 parquetFile.take(30).foreach(println) 参考: https://www.jianshu.com/p/57b20d9d7b4a?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source

How to view Apache Parquet file in Windows?

限于喜欢 提交于 2019-11-30 04:59:36
I couldn't find any plain English explanations regarding Apache Parquet files. Such as: What are they? Do I need Hadoop or HDFS to view/create/store them? How can I create parquet files? How can I view parquet files? Any help regarding these questions is appreciated. What is Apache Parquet? Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time. Apache Parquet is one of the modern

Spark DataFrames with Parquet and Partitioning

橙三吉。 提交于 2019-11-30 04:19:34
问题 I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well. So let me clarify, parquet compressed (these numbers are not

How to convert spark SchemaRDD into RDD of my case class?

倾然丶 夕夏残阳落幕 提交于 2019-11-30 03:56:05
In the spark docs it's clear how to create parquet files from RDD of your own case classes; (from the docs) val people: RDD[Person] = ??? // An RDD of case class objects, from the previous example. // The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet. people.saveAsParquetFile("people.parquet") But not clear how to convert back, really we want a method readParquetFile where we can do: val people: RDD[Person] = sc.readParquestFile[Person](path) where those values of the case class are defined are those which are read by the method. The best

Fast Parquet row count in Spark

狂风中的少年 提交于 2019-11-30 03:04:27
问题 The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151). I tried this in spark-shell : sqlContext.read.load("x.parquet").count And Spark ran two stages, showing various aggregation steps in the DAG. I figure this means it reads through the file normally instead of using the row counts. (I could be wrong.) The question is: Is Spark already using the row count fields when I run count ? Is there another API to use