parquet | 易学教程

AWS Glue Crawler adding tables for every partition?

阅读更多关于 AWS Glue Crawler adding tables for every partition?

问题 I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │ ├── year=2016 │ │ ├── year=2017 │ │ │ ├── month=11 │ │ | │ ├── sometype-2017-11-01.parquet │ | | | ├── sometype-2017-11-02.parquet │ | | | ├── ... │ │ │ ├── month=12 │ │ | │ ├── sometype-2017-12-01.parquet │ | | | ├── sometype-2017-12-02.parquet │ | | | ├── ... │ │ ├── year=2018 │ │ │ ├── month=01 │ │ | │ ├── sometype-2018-01-01.parquet │ | | | ├── sometype-2018-01-02.parquet │ | | | ├── ... │ ├──

Creating Hive table on top of multiple parquet files in s3

阅读更多关于 Creating Hive table on top of multiple parquet files in s3

问题 We have our dataset in s3 (parquet files) in the below format, data divided as multiple parquet files based on the row number. data1_1000000.parquet data1000001_2000000.parquet data2000001_3000000.parquet ... We have more than 2000 such files and each file has million records on it. All these files have same number of columns and structure. And one of the column has timestamp in it if we need to partition the dataset in hive. How can we point the dataset and create a single hive external

Can't read data in Presto - can in Hive

阅读更多关于 Can't read data in Presto - can in Hive

问题 I have a Hive DB - I created a table, compatible to Parquet file type. CREATE EXTERNAL TABLE `default.table`( `date` date, `udid` string, `message_token` string) PARTITIONED BY ( `dt` date) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://Bucket/Folder') I added partitions to this table,

Why Spark DataFrame is creating wrong number of partitions?

阅读更多关于 Why Spark DataFrame is creating wrong number of partitions?

问题 I have a spark dataframe with 2 columns - col1 and col2 . scala> val df = List((1, "a")).toDF("col1", "col2") df: org.apache.spark.sql.DataFrame = [col1: int, col2: string] When I write df on disk in parquet format, to write all the data in number of files equal to the number of unique values in col1 I do a repartition using col1 , like this: scala> df.repartition(col("col1")).write.partitionBy("col1").parquet("file") Above code produces only one file in filesystem. But, the number of shuffle

Why Spark DataFrame is creating wrong number of partitions?

阅读更多关于 Why Spark DataFrame is creating wrong number of partitions?

Dynamically create Hive external table with Avro schema on Parquet Data

阅读更多关于 Dynamically create Hive external table with Avro schema on Parquet Data

问题 I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file. My try is to use below DDL: CREATE EXTERNAL TABLE parquet_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath' TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc'); My Hive table is successfully created with the right schema, but

Dynamically create Hive external table with Avro schema on Parquet Data

阅读更多关于 Dynamically create Hive external table with Avro schema on Parquet Data

Hadoop面向行和面向列格式详解

阅读更多关于 Hadoop面向行和面向列格式详解

前言说到HDFS上面存储数据的格式，一般会想到面向行存储的Avro、SequenceFile（现在较少用）；面向列存储的Parquet、ORC等，那么在存储的时候如何选择呢？面向行存储格式（以Avro和SequenceFile为例） Avro基本概念 Avro是一个独立于编程语言的数据序列化系统。引入的原因：解决Writable类型缺乏语言的可移植性。 Avro数据文件主要是面向跨语言使用而设计的，因此，我们可以用Python语言写入文件，并用C语言来读取文件。这样的话，Avro更易于与公众共享数据集；同时也更具有生命力，该语言将使得数据具有更长的生命周期，即使原先用于读/写该数据的语言已经不再使用。 Avro的数据格式 Avro和SequenceFile的格式：（Avro与SequenceFile最大的区别就是Avro数据文件书要是面向跨语言使用而设计的） SequenceFile由文件头和随后的一条或多条记录组成（如下图）。SequenceFile的前三个字节为SEQ（顺序文件代码），紧随其后的一个字节表示SequenceFile的版本号。文件头还包括其他字段，例如键和值类的名称、数据压缩细节、用户定义的元数据以及同步标识（这些字段的格式细节可参考SequenceFile的文档http://bit.ly/sequence_file_docs和源码）。如前所述

【Spark SQL】5、DataFrame&DataSet的简单使用

阅读更多关于【Spark SQL】5、DataFrame&DataSet的简单使用

DataFrame与RDD的互操作 /** * DataFrame和RDD的互操作 */ object DataFrameRDDApp { def main ( args : Array [ String ] ) { val spark = SparkSession . builder ( ) . appName ( "DataFrameRDDApp" ) . master ( "local[2]" ) . getOrCreate ( ) //inferReflection(spark) program ( spark ) spark . stop ( ) } def program ( spark : SparkSession ) : Unit = { // RDD ==> DataFrame val rdd = spark . sparkContext . textFile ( "file:///Users/data/infos.txt" ) val infoRDD = rdd . map ( _ . split ( "," ) ) . map ( line = > Row ( line ( 0 ) . toInt , line ( 1 ) , line ( 2 ) . toInt ) ) val structType = StructType ( Array (

How can I write NULL value to parquet using org.apache.parquet.hadoop.ParquetWriter?

阅读更多关于 How can I write NULL value to parquet using org.apache.parquet.hadoop.ParquetWriter?

问题 I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files. I can write basic primitive types just fine (INT32, DOUBLE, BINARY string). I need to write NULL values, but I do not know how. I've tried simply writing null with ParquetWriter, and it throws an exception. How can I write NULL using org.apache.parquet.hadoop.ParquetWriter? Is there a nullable type? The code I believe is self explanatory: ArrayList<Type> fields = new ArrayList<>(