parquet

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

北城余情 提交于 2019-12-03 04:25:20
问题 Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6. Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive Also we

impala测试报告

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-03 04:23:35
机器环境: 4个slave节点 10.200.187.86 cslave1 4核 3G 10.200.187.87 cslave2 2核 4G 10.200.187.88 cslave3 2核 4G 10.200.187.89 cslave4 2核 6G 测试效果: [img] [/img] 总结: 1.在内存够用并且是简单sql条件下,impala相比hive执行效率高很多,简单的sql在百万级别数据中运行,耗时几秒甚至不用一秒。 2.impala性能如何,与数据的存储格式关系很大,百万级别text格式与hbase格式相差十几倍,千万级别parquet格式与text格式相差百倍。 3.在当前集群配置下,百万级别impala join略比hive性能高(3~4倍),但在千万级别时impala大表的join执行失败(内存不足)。 4.impala采用parquet存储(列式),select部分字段+where条件查询效率很高。 问题: 官方表示impala新版本可以在生产环境中使用,但根据业界人反馈,会有很多问题,重点问题是出现内存溢出情况。官方推荐impala节点的内存是128G。 建议使用场景: 部署于生产环境,可应用于运维,做简单查数据工作,效率高。但有一定的内存占用,不建议使用复杂sql例如大表join等。 imapla实时查询,如上可以看到,相对hive性能是有很大提高的

How to read a nested collection in Spark

不羁岁月 提交于 2019-12-03 04:22:45
问题 I have a parquet table with one of the columns being , array<struct<col1,col2,..colN>> Can run queries against this table in Hive using LATERAL VIEW syntax. How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark? Could not find any references to this in Spark documentation. Thanks in advance for any information! ps. I felt might be helpful to give some stats on the table. Number of columns in main table ~600. Number of rows ~200m.

大数据-sparkSQL

狂风中的少年 提交于 2019-12-03 03:43:40
SparkSQL采用Spark on Hive模式,hive只负责数据存储,Spark负责对sql命令解析执行。 SparkSQL基于Dataset实现,Dataset是一个分布式数据容器,Dataset中同时存储原始数据和元数据(schema) Dataset的底层封装了RDD,Row类型的RDD就是Dataset< Row >,DataFrame Dataset数据源包括:json,JDBC,hive,parquet,hdfs,hbase,avro... API 自带API Dataset自带了一套api能够对数据进行操作,处理逻辑与sql处理逻辑相同。 //ds代表了一个Dataset<Row>,包括字段:age,name//select name from tableds.select(ds.col("name")).show();//select name ,age+10 as addage from tableds.select(ds.col("name"),ds.col("age").plus(10).alias("addage")).show();//select name ,age from table where age>19ds.select(ds.col("name"),ds.col("age")).where(ds.col("age").gt(19))

Creating hive table using parquet file metadata

匿名 (未验证) 提交于 2019-12-03 02:50:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet. Output from writing parquet write _common_metadata part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet _SUCCESS _metadata part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet Hive table CREATE TABLE testhive ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

匿名 (未验证) 提交于 2019-12-03 02:47:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3). First, I can read a single parquet file locally like this: import pyarrow.parquet as pq path = 'parquet/part-r-00000-1e638be4-e31f-498a-a359-47d017a0059c.gz.parquet' table = pq.read_table(path) df = table.to_pandas() I can also read a directory of parquet files locally like this: import pyarrow.parquet as pq dataset = pq.ParquetDataset('parquet/') table = dataset.read() df = table.to_pandas() Both work like a charm. Now I want to achieve the same

Parquet vs ORC vs ORC with Snappy

匿名 (未验证) 提交于 2019-12-03 02:44:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through. Follows some details of my data. Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB Parquet was worst as far as compression for my table is

Spark DataFrames with Parquet and Partitioning

匿名 (未验证) 提交于 2019-12-03 02:26:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well. So let me clarify, parquet compressed (these numbers are not fully accurate). 1GB Par = 5 Blocks = 5 Partitions

Using Spark to write a parquet file to s3 over s3a is very slow

匿名 (未验证) 提交于 2019-12-03 01:58:03
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm trying to write a parquet file out to Amazon S3 using Spark 1.6.1 . The small parquet that I'm generating is ~2GB once written so it's not that much data. I'm trying to prove Spark out as a platform that I can use. Basically what I'm going is setting up a star schema with dataframes , then I'm going to write those tables out to parquet. The data comes in from csv files provided by a vendor and I'm using Spark as an ETL platform. I currently have a 3 node cluster in ec2(r3.2xlarge) So 120GB of memory on the executors and 16 cores total.

How can I insert into a hive table with parquet fileformat and SNAPPY compression?

匿名 (未验证) 提交于 2019-12-03 01:37:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Hive 2.1 I have following table definition : CREATE EXTERNAL TABLE table_snappy ( a STRING, b INT) PARTITIONED BY (c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/' TBLPROPERTIES ('parquet.compress'='SNAPPY'); Now, I would like to insert data into it : INSERT INTO table_snappy PARTITION (c='something') VALUES ('xyz', 1); However,