parquet

Spark Dataframe validating column names for parquet writes (scala)

纵饮孤独 提交于 2019-11-27 06:54:33
问题 I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as as Parquet format. However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ,;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names. How can I do such validations

第4章 SparkSQL数据源

寵の児 提交于 2019-11-27 03:52:32
第 4 章 SparkSQL 数据源 4 .1 通用加载 / 保存方法 4 .1.1 手动指定选项   Spark SQL 的 DataFrame 接口支持多种数据源的操作。一个 DataFrame 可以进行 RDDs 方式的操作,也可以被注册为临时表。把 DataFrame 注册为临时表之后,就可以对该 DataFrame 执行 SQL 查询。   Spark SQL 的默认数据源为 Parquet 格式。数据源为 Parquet 文件时, Spark SQL 可以方便的执行所有的操作。修改配置项 spark.sql.sources.default ,可修改默认数据源格式。 val df = spark.read.load("examples/src/main/resources/users.parquet") df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")   当数据源格式不是 parquet 格式文件时,需要手动指定数据源的格式。数据源格式需要指定全名(例如: org.apache.spark.sql.parquet ),如果数据源格式为内置格式,则只需要指定简称 定json, parquet, jdbc, orc, libsvm, csv, text 来指定数据 的格式。

Spark lists all leaf node even in partitioned data

若如初见. 提交于 2019-11-27 02:39:47
问题 I have parquet data partitioned by date & hour , folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data. query: select * from raw_events where event_date='2016-01-01' similar problem : http

文件存储格式

为君一笑 提交于 2019-11-27 01:32:51
Hive 支持的存储数的格式主要有: TEXTFILE 、 SEQUENCEFILE 、 ORC 、 PARQUET 1.行存储的特点 查询满足条件的一整行数据的时候,列存储则需要去每个聚集的字段找到对应的每个列的值,行存储只需要找到其中一个值,其余的值都在相邻地方,所以此时行存储查询的速度更快。 2.列存储的特点 因为每个字段的数据聚集存储,在查询只需要少数几个字段的时候,能大大减少读取的数据量;每个字段的数据类型一定是相同的,列式存储可以针对性的设计更好的设计压缩算法。 TEXTFILE 和 SEQUENCEFILE 的存储格式都是基于行存储的; ORC 和 PARQUET 是基于列式存储的。 在项目开发中, hive 表的数据存储格式一般选择: orc 或 parquet 。压缩方式一般选择 snappy , lzo。 来源: https://blog.csdn.net/weixin_42310279/article/details/99232749

What are the differences between feather and parquet?

為{幸葍}努か 提交于 2019-11-27 00:06:00
问题 Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats differ? Should you always prefer feather when working with pandas when possible? What are the use cases where feather is more suitable than parquet and the other way round? Appendix I found some hints here https://github.com/wesm/feather/issues/188,

Avro vs. Parquet

ⅰ亾dé卋堺 提交于 2019-11-26 23:57:44
问题 I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms? 回答1: If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing

What are the pros and cons of parquet format compared to other formats?

时光总嘲笑我的痴心妄想 提交于 2019-11-26 23:46:15
问题 Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats , it gives some insights on the formats but I would like to know how the access to data & storage of data is done in each of these formats. How parquet has an advantage over the others? 回答1: I think the main difference I can describe relates to record

How to split parquet files into many partitions in Spark?

怎甘沉沦 提交于 2019-11-26 23:14:08
问题 So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster). Now according to a few sources (like below)

Does Spark support Partition Pruning with Parquet Files

你离开我真会死。 提交于 2019-11-26 23:02:03
问题 I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id . The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands: sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true") sqlContext.setConf("spark.sql.parquet.filterPushdown", "true") val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'") I would expect a fast

Create Hive table to read parquet files from parquet/avro schema

时光怂恿深爱的人放手 提交于 2019-11-26 21:09:03
问题 We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema. in other way, how to generate a hive table from a parquet/avro schema ? thanks :) 回答1: Try below using avro schema: CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION