parquet

select specific columns in Spark DataFrames from Array of Struct

倾然丶 夕夏残阳落幕 提交于 2019-12-23 17:15:32
问题 I have a Spark DataFrame df with the following Schema: root |-- k: integer (nullable = false) |-- v: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- a: integer (nullable = false) | | |-- b: double (nullable = false) | | |-- c: string (nullable = true) Is it possible to just select a, c in v from df without doing a map ? In particular, df is loaded from a Parquet file and I don't want the values for c to even be loaded/read. 回答1: It depends on exactly what you

Out of memory error when writing out spark dataframes to parquet format

我与影子孤独终老i 提交于 2019-12-23 09:32:19
问题 I'm trying to query data from a database, do some transformations on it and save the new data in parquet format on hdfs. Since the database query returns a large number of rows, I'm getting the data in batches and running the above process on every incoming batch. UPDATE 2: The batch processing logic is: import scala.collection.JavaConverters._ import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.spark.sql.Row import org.apache.spark.sql.types.

How to combine small parquet files to one large parquet file? [duplicate]

半城伤御伤魂 提交于 2019-12-23 05:31:22
问题 This question already has answers here : Spark dataframe write method writing many small files (6 answers) Closed last year . I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and

Exporting nested fields with invalid characters from Spark 2 to Parquet [duplicate]

我的梦境 提交于 2019-12-23 01:50:15
问题 This question already has answers here : Spark Dataframe validating column names for parquet writes (scala) (4 answers) Closed last year . I am trying to use spark 2.0.2 to convert a JSON file into parquet. The JSON file comes from an external source and therefor the schema can't be changed before it arrives. The file contains a map of attributes. The attribute names arn't known before I receive the file. The attribute names contain characters that can't be used in parquet. { "id" : 1, "name"

How to specify logical types when writing Parquet files from PyArrow?

那年仲夏 提交于 2019-12-22 10:53:06
问题 I'm using PyArrow to write Parquet files from some Pandas dataframes in Python. Is there a way that I can specify the logical types that are written to the parquet file? For for example, writing an np.uint32 column in PyArrow results in an INT64 column in the parquet file, whereas writing the same using the fastparquet module results in an INT32 column with a logical type of UINT_32 (this is the behaviour I'm after from PyArrow). E.g.: import pandas as pd import pyarrow as pa import pyarrow

How to set Parquet file encoding in Spark

那年仲夏 提交于 2019-12-22 06:45:22
问题 Parquet documentation describe few different encodings here Is it changes somehow inside file during read/write, or I can set it? Nothing about it in Spark documentation. Only found slides from speach by Ryan Blue from Netflix team. He sets parquet configurations to sqlContext sqlContext.setConf("parquet.filter.dictionary.enabled", "true") Looks like it's not about plain dictionary encoding in Parquet files. 回答1: So I found an answer to my question on twitter engineering blog. Parquet has an

Reading/writing with Avro schemas AND Parquet format in SparkSQL

杀马特。学长 韩版系。学妹 提交于 2019-12-22 06:44:30
问题 I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads. My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's). I

Partitions not being pruned in simple SparkSQL queries

爱⌒轻易说出口 提交于 2019-12-22 05:33:28
问题 I'm trying to efficiently select individual partitions from a SparkSQL table (parquet in S3). However, I see evidence of Spark opening all parquet files in the table, not just those that pass the filter. This makes even small queries expensive for tables with large numbers of partitions. Here's an illustrative example. I created a simple partitioned table on S3 using SparkSQL and a Hive metastore: # Make some data df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 'o',

Cloudera 5.6: Parquet does not support date. See HIVE-6384

徘徊边缘 提交于 2019-12-22 04:21:23
问题 I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error. create table sfdc_opportunities_sandbox_parquet like sfdc_opportunities_sandbox STORED AS PARQUET Error Message Parquet does not support date. See HIVE-6384 I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue? 回答1: Except from using an other data type like TIMESTAMP

Cloudera 5.6: Parquet does not support date. See HIVE-6384

六月ゝ 毕业季﹏ 提交于 2019-12-22 04:21:20
问题 I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error. create table sfdc_opportunities_sandbox_parquet like sfdc_opportunities_sandbox STORED AS PARQUET Error Message Parquet does not support date. See HIVE-6384 I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue? 回答1: Except from using an other data type like TIMESTAMP