parquet | 易学教程

Multiple spark jobs appending parquet data to same base path with partitioning

阅读更多关于 Multiple spark jobs appending parquet data to same base path with partitioning

I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning. e.g. dataFrame.write(). partitionBy("eventDate", "category") .mode(Append) .parquet("s3://bucket/save/path"); Job 1 - category = "billing_events" Job 2 - category = "click_events" Both of these jobs will truncate any existing partitions that exist in the s3 bucket prior to execution and then save the resulting parquet files to their respective partitions. i.e. job 1 - > s3://bucket/save/path/eventDate=20160101/channel=billing_events job 2 - > s3://bucket/save/path/eventDate

How to suppress parquet log messages in Spark?

阅读更多关于 How to suppress parquet log messages in Spark?

问题 How to stop such messages from coming on my spark-shell console. 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 89213 records. 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 2 ms. row count = 120141 5 May, 2015 5:14:30 PM INFO: parquet.hadoop.InternalParquetRecordReader:

基于 Spark 的数据分析实践

阅读更多关于基于 Spark 的数据分析实践

转载本文需注明出处：微信公众号EAWorld，违者必究。引言： Spark是在借鉴了MapReduce之上发展而来的，继承了其分布式并行计算的优点并改进了MapReduce明显的缺陷。Spark主要包含了Spark Core、Spark SQL、Spark Streaming、MLLib和GraphX等组件。本文主要分析了 Spark RDD 以及 RDD 作为开发的不足之处，介绍了 SparkSQL 对已有的常见数据系统的操作方法，以及重点介绍了普元在众多数据开发项目中总结的基于 SparkSQL Flow 开发框架。目录：一、Spark RDD 二、基于Spark RDD数据开发的不足三、SparkSQL 四、SparkSQL Flow 一、Spark RDD RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、元素可并行计算的集合。 RDD具有数据流模型的特点：自动容错、位置感知性调度和可伸缩性。 //Scala 在内存中使用列表创建 val lines = List(“A”, “B”, “C”, “D” …) val rdd:RDD = sc.parallelize(lines); //以文本文件创建 val rdd:RDD[String] = sc.textFile(

Write parquet from AWS Kinesis firehose to AWS S3

阅读更多关于 Write parquet from AWS Kinesis firehose to AWS S3

I would like to ingest data into s3 from kinesis firehose formatted as parquet. So far I have just find a solution that implies creating an EMR, but I am looking for something cheaper and faster like store the received json as parquet directly from firehose or use a Lambda function. Thank you very much, Javi. Good news, this feature was released today! Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and ORC are columnar data formats that save space and enable faster queries To enable,

SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

阅读更多关于 SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

问题 I have a DataFrame generated as follows: df.groupBy($"Hour", $"Category") .agg(sum($"value").alias("TotalValue")) .sort($"Hour".asc,$"TotalValue".desc)) The results look like: +----+--------+----------+ |Hour|Category|TotalValue| +----+--------+----------+ | 0| cat26| 30.9| | 0| cat13| 22.1| | 0| cat95| 19.6| | 0| cat105| 1.3| | 1| cat67| 28.5| | 1| cat4| 26.8| | 1| cat13| 12.6| | 1| cat23| 5.3| | 2| cat56| 39.6| | 2| cat40| 29.7| | 2| cat187| 27.9| | 2| cat68| 9.8| | 3| cat8| 35.6| | ...| ..

EntityTooLarge error when uploading a 5G file to Amazon S3

阅读更多关于 EntityTooLarge error when uploading a 5G file to Amazon S3

问题 Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G file '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?> <Error> <Code>EntityTooLarge</Code> <Message>Your proposed upload exceeds the maximum allowed size</Message> <ProposedSize>5374138340</ProposedSize> ... <MaxSizeAllowed

create parquet files in java

阅读更多关于 create parquet files in java

问题 Is there a way to create parquet files from java? I have data in memory (java classes) and I want to write it into a parquet file, to later read it from apache-drill. Is there an simple way to do this, like inserting data into a sql table? GOT IT Thanks for the help. Combining the answers and this link, I was able to create a parquet file and read it back with drill. 回答1: ParquetWriter's constructors are deprecated(1.8.1) but not ParquetWriter itself, you can still create ParquetWriter by

Cast int96 timestamp from parquet to golang

阅读更多关于 Cast int96 timestamp from parquet to golang

Having this 12 byte array (int96) to timestamp. [128 76 69 116 64 7 0 0 48 131 37 0] How do I cast it to timestamp? I understand the first 8 byte should be cast to int64 millisecond that represent an epoch datetime. The first 8 bytes are time in nanosecs, not millisecs. They are not measured from the epoch either, but from midnight. The date part is stored separatly in the last 4 bytes as Julian day number . Here is the result of an experiment I did earlier that may help. I stored '2000-01-01 12:34:56' as an int96 and dumped with parquet-tools: $ parquet-tools dump hdfs://path/to/parquet/file

Convert csv to parquet file using python

阅读更多关于 Convert csv to parquet file using python

问题 I am trying to convert a .csv file to a .parquet file. The csv file ( Temp.csv ) has the following format 1,Jon,Doe,Denver I am using the following python code to convert it into parquet from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * import os if __name__ == "__main__": sc = SparkContext(appName="CSV2Parquet") sqlContext = SQLContext(sc) schema = StructType([ StructField("col1", IntegerType(), True), StructField("col2", StringType(), True),

Why can't Impala read parquet files after Spark SQL's write?

阅读更多关于 Why can't Impala read parquet files after Spark SQL's write?

Having some issues with the way that Spark is interpreting columns for parquet. I have an Oracle source with confirmed schema (df.schema() method): root |-- LM_PERSON_ID: decimal(15,0) (nullable = true) |-- LM_BIRTHDATE: timestamp (nullable = true) |-- LM_COMM_METHOD: string (nullable = true) |-- LM_SOURCE_IND: string (nullable = true) |-- DATASET_ID: decimal(38,0) (nullable = true) |-- RECORD_ID: decimal(38,0) (nullable = true) Which is then saved as Parquet - df.write().parquet() method - with corresponding message type (determined by Spark): message spark_schema { optional int64 LM_PERSON