parquet

How to set Parquet file encoding in Spark

倖福魔咒の 提交于 2019-12-05 10:49:06
Parquet documentation describe few different encodings here Is it changes somehow inside file during read/write, or I can set it? Nothing about it in Spark documentation. Only found slides from speach by Ryan Blue from Netflix team. He sets parquet configurations to sqlContext sqlContext.setConf("parquet.filter.dictionary.enabled", "true") Looks like it's not about plain dictionary encoding in Parquet files. So I found an answer to my question on twitter engineering blog . Parquet has an automatic dictionary encoding enabled when a number of unique values < 10^5. Here is a post announcing

Spark's int96 time type

爱⌒轻易说出口 提交于 2019-12-05 10:15:14
When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day. This does not conform to any parquet logical type . The schema in the parquet file does not, then, give an indication of the column being anything but an integer. My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer? Semantics is determined based on the metadata. We'll need some imports: import org.apache.parquet.hadoop.ParquetFileReader

Convert Parquet to CSV

混江龙づ霸主 提交于 2019-12-05 09:59:59
How to convert Parquet to CSV from a local file system (e.g. python, some library etc.) but WITHOUT Spark? (trying to find as simple and minimalistic solution as possible because need to automate everything and not much resources). I tried with e.g. parquet-tools on my Mac but data output did not look correct. Need to make output so that when data is not present in some columns - CSV will have corresponding NULL (empty column between 2 commas).. Thanks. You can do this by using the Python packages pandas and pyarrow ( pyarrow is an optional dependency of pandas that you need for this feature).

parquet 简介(转)

北慕城南 提交于 2019-12-05 09:33:23
原文 Parquet 列式存储格式 面向分析型业务的列式存储格式 由 Twitter 和 Cloudera 合作开发,2015 年 5 月从 Apache 的孵化器里毕业成为 Apache 顶级项目 列式存储 列式存储和行式存储相比有哪些优势呢? 可以跳过不符合条件的数据,只读取需要的数据,降低 IO 数据量。 压缩编码可以降低磁盘存储空间。由于同一列的数据类型是一样的,可以使用更高效的压缩编码(例如 Run Length Encoding 和 Delta Encoding)进一步节约存储空间。 只读取需要的列,支持向量运算,能够获取更好的扫描性能。 当时 Twitter 的日增数据量达到压缩之后的 100TB+,存储在 HDFS 上,工程师会使用多种计算框架(例如 MapReduce, Hive, Pig 等)对这些数据做分析和挖掘; 日志结构是复杂的嵌套数据类型,例如一个典型的日志的 schema 有 87 列,嵌套了 7 层。所以需要设计一种列式存储格式,既能支持关系型数据(简单数据类型),又能支持复杂的嵌套类型的数据,同时能够适配多种数据处理框架。 关系型数据的列式存储,可以将每一列的值直接排列下来,不用引入其他的概念,也不会丢失数据。 关系型数据的列式存储比较好理解,而嵌套类型数据的列存储则会遇到一些麻烦。 如图 1 所示,我们把嵌套数据类型的一行叫做一个记录

Parquet介绍及简单使用(转)

…衆ロ難τιáo~ 提交于 2019-12-05 09:32:58
==> 什么是parquet Parquet 是列式存储的一种文件类型 ==> 官网描述: Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language 无论数据处理框架,数据模型或编程语言的选择如何,Apache Parquet都是Hadoop生态系统中任何项目可用的列式存储格式 ==> 由来 Parquet的灵感来自于2010年Google发表的Dremel论文,文中介绍了一种支持嵌套结构的存储格式,并且使用了列式存储的方式提升查询性能,在Dremel论文中还介绍了Google如何使用这种存储格式实现并行查询的,如果对此感兴趣可以参考论文和开源实现Apache Drill。 ==> 特点: ---> 可以跳过不符合条件的数据,只读取需要的数据,降低 IO 数据量 ---> 压缩编码可以降低磁盘存储空间(由于同一列的数据类型是一样的,可以使用更高效的压缩编码(如 Run Length Encoding t Delta Encoding)进一步节约存储空间) --->

Apache Spark Parquet: Cannot build an empty group

£可爱£侵袭症+ 提交于 2019-12-05 07:46:51
I use Apache Spark 2.1.1 (used 2.1.0 and it was the same, switched today). I have a dataset: root |-- muons: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- reco::Candidate: struct (nullable = true) | | |-- qx3_: integer (nullable = true) | | |-- pt_: float (nullable = true) | | |-- eta_: float (nullable = true) | | |-- phi_: float (nullable = true) | | |-- mass_: float (nullable = true) | | |-- vertex_: struct (nullable = true) | | | |-- fCoordinates: struct (nullable = true) | | | | |-- fX: float (nullable = true) | | | | |-- fY: float (nullable = true) | | | | |-

hive中parquet和SEQUENCEFILE区别

旧街凉风 提交于 2019-12-05 07:09:36
TEXTFILE和SEQUENCEFILE的存储格式都是基于行存储的;并且SEQUENCEFILE是存储为二进制文件 ORC和PARQUET是基于列式存储的。 ORC是列式存储,RC是行式存储 目录 概述 hive文件存储格式包括以下几类 一、TEXTFILE 二、SEQUENCEFILE 三、RCFile文件格式 概述历史 RCFile使用 基于行存储的优点和缺点 基于列存储的优点和缺点 源码分析 1. Writer 2. append RCFile的索引机制 flushRecords的具体逻辑 RCFile的Sync机制 RCFileclose过程 数据读取和Lazy解压 行组大小 四、ORC文件格式 ORC File格式的优点 设计思想 Stripe结构 Hive里面如何用ORCFile 五、Parquet文件格式 概述 Parquet数据模型 Parquet文件结构 Definition Level Repetition Level Metadata 概述 1. hive文件存储格式包括以下几类: TEXTFILE SEQUENCEFILE RCFILE ORCFILE Parquet 其中 TEXTFILE为默认格式 ,建表时不指定默认为这个格式,导入数据时会直接把数据文件拷贝到hdfs上不进行处理。 sequencefile,rcfile

Partitions not being pruned in simple SparkSQL queries

对着背影说爱祢 提交于 2019-12-05 06:05:54
I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions. Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore: # Make some data df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 'o', 'u']*3, 'v': range(15)}) # Convert to a SparkSQL DataFrame sdf = hiveContext.createDataFrame(df) # And

Error using spark 'save' does not support bucketing right now

烂漫一生 提交于 2019-12-05 06:01:35
问题 I have a DataFrame which I am trying to partitionBy a column, sort it by that column and save in parquet format using the following command: df.write().format("parquet") .partitionBy("dynamic_col") .sortBy("dynamic_col") .save("test.parquet"); I get the following error: reason: User class threw exception: org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; Is save(...) not allowed? Is only saveAsTable(...) allowed which saves the data to Hive? Any suggestions

Build failure - Apache Parquet-MR source (mvn install failure)

江枫思渺然 提交于 2019-12-05 04:55:56
I am getting following error while trying to execute "mvn clean install" for building parquet-mr source obtained from https://github.com/apache/parquet-mr [INFO] Storing buildScmBranch: UNKNOWN [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ parquet-generator --- [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Parquet MR ................................. SUCCESS [1.494s] [INFO] Apache Parquet Generator .......................... FAILURE [0.064s] [INFO] Apache Parquet Common ....................