parquet | 易学教程

Cloudera 5.6: Parquet does not support date. See HIVE-6384

阅读更多关于 Cloudera 5.6: Parquet does not support date. See HIVE-6384

I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error. create table sfdc_opportunities_sandbox_parquet like sfdc_opportunities_sandbox STORED AS PARQUET Error Message Parquet does not support date. See HIVE-6384 I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue? Except from using an other data type like TIMESTAMP or an other storage format like ORC , there might be no way around if there is a dependency to the used

Spark学习(2) Spark SQL

阅读更多关于 Spark学习(2) Spark SQL

什么事sparkSQL Spark SQL是Spark用来处理结构化数据的一个模块，它提供了一个编程抽象叫做DataFrame并且作为分布式SQL查询引擎的作用, 它是将Spark SQL转换成RDD，然后提交到集群执行，执行效率非常快 1）易整合 2）统一的数据访问方式 3）兼容Hive 4）标准的数据连接 SparkSQL可以看做是一个转换层，向下对接各种不同的结构化数据源，向上提供不同的数据访问方式 RDD和Dataframe和DataSet RDD: 劣势是性能限制，它是一个JVM驻内存对象, 这也就决定了存在GC的限制和数据增加时Java序列化成本的升高 , 无法使用sql进行操作 , 需要考虑怎么做 DataFrame: DataFrame更像传统数据库的二维表格，除了数据以外，还记录数据的结构信息，即schema。DataFrame除了提供了比RDD更丰富的算子以外，更重要的特点是提升执行效率、减少数据读取以及执行计划的优化，比如filter下推、裁剪 , 懒执行定制化内存管理 DataFrame数据以二进制的方式存在于非堆内存缺少类型安全检查 dataSet : 既具有类型安全检查也具有Dataframe的查询优化特性 DataFrame/Dataset 转 RDD ： val rdd1=testDF.rdd val rdd2=testDS.rdd RDD 转

Spark Exception : Task failed while writing rows

阅读更多关于 Spark Exception : Task failed while writing rows

问题 I am reading text files and converting them to parquet files. I am doing it using spark code. But when i try to run the code I get following exception org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 9, XXXX.XXX.XXX.local): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources

load parquet file and keep same number hdfs partitions

阅读更多关于 load parquet file and keep same number hdfs partitions

I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M. Total size hdfs dfs -du -s -h /df 5.1 G 15.3 G /df hdfs dfs -du -h /df 43.6 M 130.7 M /df/pid=0 43.5 M 130.5 M /df/pid=1 ... 43.6 M 130.9 M /df/pid=119 I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions. df = spark.read.parquet('df') df.rdd.getNumPartitions() 60 HDFS settings: 'parquet.block.size' is not set. sc._jsc.hadoopConfiguration().get('parquet.block.size') returns nothing. 'dfs

hive 存储格式对比

阅读更多关于 hive 存储格式对比

Apache Hive支持Apache Hadoop中使用的几种熟悉的文件格式，如 TextFile，RCFile，SequenceFile，AVRO，ORC和Parquet 格式。 Cloudera Impala也支持这些文件格式。在建表时使用 STORED AS ( TextFile|RCFile|SequenceFile|AVRO|ORC|Parquet ) 来指定存储格式。 TextFile 每一行都是一条记录，每行都以换行符（\ n）结尾。数据不做压缩，磁盘开销大，数据解析开销大。可结合Gzip、Bzip2使用（系统自动检查，执行查询时自动解压），但使用这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作。 SequenceFile 是Hadoop API提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。支持三种压缩选择：NONE, RECORD, BLOCK。 Record压缩率低，一般建议使用BLOCK压缩。 RCFile 是一种行列存储相结合的存储方式。首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。其次，块数据列式存储，有利于数据压缩和快速的列存取。 AVRO 是开源项目，为Hadoop提供数据序列化和数据交换服务。您可以在Hadoop生态系统和以任何编程语言编写的程序之间交换数据

How to control the number of output part files created by Spark job upon writing?

阅读更多关于 How to control the number of output part files created by Spark job upon writing?

Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as of Spark 1.4 Spark job creates plenty of small part files in final output directory. As far as I understand Spark creates part file for each partition/task please correct me if I am wrong. How do we control amount of part files Spark creates? Finally I would like to create Hive table using

How to execute a spark sql query from a map function (Python)?

阅读更多关于 How to execute a spark sql query from a map function (Python)?

How does one execute spark sql queries from routines that are not the driver portion of the program? from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * def doWork(rec): data = SQLContext.sql("select * from zip_data where STATEFP ='{sfp}' and COUNTYFP = '{cfp}' ".format(sfp=rec[0], cfp=rec[1])) for item in data.collect(): print(item) # do something return (rec[0], rec[1]) if __name__ == "__main__": sc = SparkContext(appName="Some app") print("Starting some app") SQLContext = SQLContext(sc) parquetFile = SQLContext.read.parquet("/path/to/data/")

Parquet without Hadoop?

阅读更多关于 Parquet without Hadoop?

问题 I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency? 回答1: Investigating the same question I found that apparently it's not possible for the moment. I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet. In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop

read a parquet files from HDFS using PyArrow

阅读更多关于 read a parquet files from HDFS using PyArrow

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet 's read_table() However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance. Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along. Try fs = pa.hdfs.connect(...) fs.read_parquet('/path/to/hdfs-file', **other_options) or import pyarrow.parquet as pq

Spark Parquet file split

阅读更多关于 Spark Parquet file split

在实际使用 spark + parquet 的时候, 遇到了两个不解的地方: 我们只有一个 parquet 文件(小于 hdfs block size), 但是 spark 在某个 stage 生成了4个 tasks 来处理. 4个 tasks 中只有一个 task 处理了所有数据, 其他几个都没有处理数据. 这两个问题牵涉到对于 parquet spark 是如何来进行切分 partitions, 以及每个 partition 要处理哪部分数据的. 先说结论 , spark 中, parquet 是 splitable 的, 代码见 ParquetFileFormat#isSplitable . 那会不会把数据切碎? 答案是不会, 因为是以 spark row group 为最小单位切分 parquet 的, 这也会导致一些 partitions 会没有数据, 极端情况下, 只有一个 row group 的话, partitions 再多, 也只会一个有数据. 接下来开始我们的源码之旅: 处理流程 1. 根据 parquet 按文件大小切块生成 partitions: 在 FileSourceScanExec#createNonBucketedReadRDD 中, 如果文件是 splitable 的, 会按照 maxSplitBytes 把文件切分, 最后生成的数量, 就是

订阅 parquet