parquet | 易学教程

Spark 2.2 cannot write df to parquet

阅读更多关于 Spark 2.2 cannot write df to parquet

问题 I'm building a clustering algorithm and I need to store the model for future loading. I have a dataframe with this schema: val schema = new StructType() .add(StructField("uniqueId", LongType)) .add(StructField("timestamp", LongType)) .add(StructField("pt", ArrayType(DoubleType))) .add(StructField("norm", DoubleType)) .add(StructField("kNN", ArrayType(LongType))) .add(StructField("kDist", DoubleType)) .add(StructField("lrd", DoubleType)) .add(StructField("lof", DoubleType)) .add(StructField(

How do you use s3a with spark 2.1.0 on aws us-east-2?

阅读更多关于 How do you use s3a with spark 2.1.0 on aws us-east-2?

问题 Background I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook. This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have

Parquet Output From Kafka Connect to S3

阅读更多关于 Parquet Output From Kafka Connect to S3

问题 I see Kafka Connect can write to S3 in Avro or JSON formats. But there is no Parquet support. How hard would this be to add? 回答1: The Qubole connector supports writing out parquet - https://github.com/qubole/streamx 回答2: Try secor : https://github.com/pinterest/secor Can work with AWS S3, google cloud, Azure's blob storage etc. Note that the solution you choose must have key features like: Guarantee writing each message exactly once, load distribution, fault tolerance, monitoring,

Why does the query performance differ with nested columns in Spark SQL?

阅读更多关于 Why does the query performance differ with nested columns in Spark SQL?

问题 I write some data in the Parquet format using Spark SQL where the resulting schema looks like the following: root |-- stateLevel: struct (nullable = true) | |-- count1: integer (nullable = false) | |-- count2: integer (nullable = false) | |-- count3: integer (nullable = false) | |-- count4: integer (nullable = false) | |-- count5: integer (nullable = false) |-- countryLevel: struct (nullable = true) | |-- count1: integer (nullable = false) | |-- count2: integer (nullable = false) | |-- count3

Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

阅读更多关于 Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format

问题 I have a HIVE table which will hold billions of records, its a time-series data so the partition is per minute. Per minute we will have around 1 million records. I have few fields in my table, VIN number (17 chars), Status (2 chars) ... etc So my question is during the table creation if I choose to use Varchar(X) vs String, is there any storage or performance problem, Few limitation of varchar are https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-string

How do I Configure file format of AWS Athena results

阅读更多关于 How do I Configure file format of AWS Athena results

问题 Currently, the Athena query results are in tsv format in S3. Is there any way to configure Athena queries to return results in Parquet format. 回答1: Answer At this moment it isn't possible to do it directly with Athena. When it comes to configure result of the Athena query you can only setup query result location and encryption configuration. Workaround 1) From October Athena supports CTAS query, you can try to use this feature. https://docs.aws.amazon.com/athena/latest/ug/ctas.html https:/

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

阅读更多关于 Using predicates to filter rows from pyarrow.parquet.ParquetDataset

问题 I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow . Here's my attempt: import pyarrow.parquet as pq import s3fs fs = s3fs.S3FileSystem() dataset = pq.ParquetDataset( 'analytics.xxx', filesystem=fs, validate_schema=False, filters=[('event_name', '=', 'SomeEvent')] ) df = dataset.read_pandas().to_pandas() But that returns a pandas DataFrame as if the filter didn't

How can I insert into a hive table with parquet fileformat and SNAPPY compression?

阅读更多关于 How can I insert into a hive table with parquet fileformat and SNAPPY compression?

问题 Hive 2.1 I have following table definition : CREATE EXTERNAL TABLE table_snappy ( a STRING, b INT) PARTITIONED BY (c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/' TBLPROPERTIES ('parquet.compress'='SNAPPY'); Now, I would like to insert data into it : INSERT INTO table_snappy

Spark Streaming appends to S3 as Parquet format, too many small partitions

阅读更多关于 Spark Streaming appends to S3 as Parquet format, too many small partitions

问题 I am building an app that uses Spark Streaming to receive data from Kinesis streams on AWS EMR. One of the goals is to persist the data into S3 (EMRFS), and for this I am using a 2 minutes non-overlapping window. My approaches: Kinesis Stream -> Spark Streaming with batch duration about 60 seconds, using a non-overlapping window of 120s, save the streamed data into S3 as: val rdd1 = kinesisStream.map( rdd => /* decode the data */) rdd1.window(Seconds(120), Seconds(120).foreachRDD { rdd => val

Avoid losing data type for the partitioned data when writing from Spark

阅读更多关于 Avoid losing data type for the partitioned data when writing from Spark

问题 I have a dataframe like below. itemName, itemCategory Name1, C0 Name2, C1 Name3, C0 I would like to save this dataframe as partitioned parquet file: df.write.mode("overwrite").partitionBy("itemCategory").parquet(path) For this dataframe, when I read the data back, it will have String the data type for itemCategory . However at times, I have dataframe from other tenants as below. itemName, itemCategory Name1, 0 Name2, 1 Name3, 0 In this case, after being written as partition, when read back,