parquet

精品 | HIVE优化浅谈

蓝咒 提交于 2020-01-14 15:09:59
简介: HIVE是数据仓库和交互式查询的优秀框架,但随着数据的增多,join的复杂度和性能问题,需要花时间和精力解决性能优化的问题。除了基于HIVE本身优化,还可以接入计算性能更好的框架,SparkSQL relational cache对使用者透明,开发不需要关心底层优化逻辑,将更多精力放入业务设计开发。 作者:邓力,entobit技术总监,八年大数据从业经历,由一代HADOOP入坑,深耕云计算应用领域,由从事亚马逊EMR和阿里云EMR应用开发逐步转入大数据架构领域,对大数据生态及框架应用有深刻理解。 引言 随着商务/运营同学执行的HQL越来越多,整体HIVE执行效率变低,本文从HIVE切入,分析HQL面临的问题和待优化部分,结合其他大数据框架来解决实际问题。以下内容没有针对业务代码提供优化建议. 常见的HQL select型 设置hive.fetch.task.conversion=none会以集群模式运行,无论是否有limit。在数据量小时建议使用hive.fetch.task.conversion=more,此时select配合limit以单机执行获取样本数据,执行更快 常见的select配合order by/group by等基本操作不在此赘述 已经为大家精心准备了大数据的系统学习资料,从Linux-Hadoop-spark-......,需要的小伙伴可以点击 注:

How to control the number of output part files created by Spark job upon writing?

时间秒杀一切 提交于 2020-01-13 05:46:09
问题 Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as of Spark 1.4 Spark job creates plenty of small part files in final output directory. As far as I understand Spark creates part file for each partition/task please correct me if I am wrong.

Spark Parquet Statistics(min/max) integration

蹲街弑〆低调 提交于 2020-01-11 02:26:09
问题 I have been looking into how Spark stores statistics (min/max) in Parquet as well as how it uses the info for query optimization. I have got a few questions. First setup: Spark 2.1.0, the following sets up a Dataframe of 1000 rows, with a long type and a string type column. They are sorted by different columns, though. scala> spark.sql("select id, cast(id as string) text from range(1000)").sort("id").write.parquet("/secret/spark21-sortById") scala> spark.sql("select id, cast(id as string)

Disable parquet metadata summary in Spark

假如想象 提交于 2020-01-11 02:18:28
问题 I have a spark job (for 1.4.1) receiving a stream of kafka events. I would like to save them continuously as parquet on tachyon. val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) lines.window(Seconds(1), Seconds(1)).foreachRDD { (rdd, time) => if (rdd.count() > 0) { val mil = time.floor(Duration(86400000)).milliseconds hiveContext.read.json(rdd).toDF().write.mode(SaveMode.Append).parquet(s"tachyon://192.168.1.12:19998/persisted5$mil") hiveContext.sql(s"CREATE TABLE

Spark SQL数据加载和保存实例讲解

你。 提交于 2020-01-10 14:28:32
一、前置知识详解 Spark SQL重要是操作DataFrame,DataFrame本身提供了save和load的操作, Load:可以创建DataFrame, Save:把DataFrame中的数据保存到文件或者说与具体的格式来指明我们要读取的文件的类型以及与具体的格式来指出我们要输出的文件是什么类型。 二、Spark SQL读写数据代码实战 import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.*; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import java.util.ArrayList; import java.util.List; public class

Multiple spark jobs appending parquet data to same base path with partitioning

 ̄綄美尐妖づ 提交于 2020-01-10 07:42:47
问题 I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning. e.g. dataFrame.write(). partitionBy("eventDate", "category") .mode(Append) .parquet("s3://bucket/save/path"); Job 1 - category = "billing_events" Job 2 - category = "click_events" Both of these jobs will truncate any existing partitions that exist in the s3 bucket prior to execution and then save the resulting parquet files to their respective partitions. i.e. job 1 - > s3:/

Multiple spark jobs appending parquet data to same base path with partitioning

不羁岁月 提交于 2020-01-10 07:42:19
问题 I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning. e.g. dataFrame.write(). partitionBy("eventDate", "category") .mode(Append) .parquet("s3://bucket/save/path"); Job 1 - category = "billing_events" Job 2 - category = "click_events" Both of these jobs will truncate any existing partitions that exist in the s3 bucket prior to execution and then save the resulting parquet files to their respective partitions. i.e. job 1 - > s3:/

day15-数据源

僤鯓⒐⒋嵵緔 提交于 2020-01-10 02:21:08
前言 day13 ,我们学习了Spark SQL的DataFrame。今天开始进入Spark SQL的数据源。 Spark数据源介绍 在Spark-sql中可以使用各种各样的数据源来创建DataFrame或者DataSet,spark-sql对数据源兼容性比较好,并且提供了load 方法来加载数据,save方法保存数据源。load、save的时候默认都是以parquert格式处理。 parquet数据源 spark官方也提供了一些parquet数据源Demo,存放在Spark安装主目录的examples/src/main/resources/下面,下面使用官方提供的parquet数据源进行一些演示。 #读取parquert数据 scala> var df2 = spark . read . load ( "/opt/module/spark-2.1.0-bin-hadoop2.7/examples/src/main/resources/users.parquet" ) df2: org . apache . spark . sql . DataFrame = [name: string, favorite_color: string ... 1 more field] #查看数据 scala> df2 . show + - -- - -- + -- - -- - -- - -- -

is Parquet predicate pushdown works on S3 using Spark non EMR?

核能气质少年 提交于 2020-01-09 10:13:52
问题 Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR). Further explanation might be helpful since it might involve understanding on distributed file system. 回答1: Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down). See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache

is Parquet predicate pushdown works on S3 using Spark non EMR?

百般思念 提交于 2020-01-09 10:12:49
问题 Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR). Further explanation might be helpful since it might involve understanding on distributed file system. 回答1: Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down). See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache