parquet

is Parquet predicate pushdown works on S3 using Spark non EMR?

一曲冷凌霜 提交于 2020-01-09 10:12:48
问题 Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR). Further explanation might be helpful since it might involve understanding on distributed file system. 回答1: Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down). See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache

is Parquet predicate pushdown works on S3 using Spark non EMR?

北城以北 提交于 2020-01-09 10:12:10
问题 Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR). Further explanation might be helpful since it might involve understanding on distributed file system. 回答1: Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down). See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache

Pass List[String] to function that takes f(args: String*) scala

依然范特西╮ 提交于 2020-01-07 03:02:50
问题 I need to read in specific parquet files with spark, I know this can be done like so: sqlContext .read .parquet("s3://bucket/key", "s3://bucket/key") Right now I have a List[String] object with all these s3 paths in it but I don't know how I can pass this programmatically to the parquet function in Scala? There are way to many files to do it manually, any ideas how to get the files into the parquet function programmatically? 回答1: I've answer a similar question earlier concerning repeated

Spark save(write) parquet only one file

不想你离开。 提交于 2020-01-03 08:08:06
问题 if i write dataFrame.write.format("parquet").mode("append").save("temp.parquet") in temp.parquet folder i got the same file numbers as the row numbers i think i'm not fully understand about parquet but is it natural? 回答1: Use coalesce before write operation dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet") EDIT-1 Upon a closer look, the docs do warn about coalesce However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your

Hive LLAP doesn't work with Parquet format

萝らか妹 提交于 2020-01-02 22:40:41
问题 After finding out Hive LLAP, I really want to use it. I started Azure HDinsight cluster with LLAP enabled. However, it doesn't seem to work any better than normal Hive. I have data stored in Parquet files. I only see ORC files mentioned in LLAP related docs or talks. Does it also support Parquet format? 回答1: Answering my own question. We reached out to Azure support. Hive LLAP only works with ORC file format (as of 05.2017). So with Parquet either we have to use Apache Impala for fast

Spark SQL: Nested classes to parquet error

假装没事ソ 提交于 2020-01-02 07:18:08
问题 I can't seem to write to parquet a JavaRDD<T> where T is a say, Person class. I've defined it as public class Person implements Serializable { private static final long serialVersionUID = 1L; private String name; private String age; private Address address; .... with Address : public class Address implements Serializable { private static final long serialVersionUID = 1L; private String City; private String Block; ...<getters and setters> I then create a JavaRDD like so: JavaRDD<Person> people

disabling _spark_metadata in Structured streaming in spark 2.3.0

纵然是瞬间 提交于 2020-01-02 05:46:12
问题 My Structured Streaming application is writing to parquet and i want to get rid of the _spark_metadata folder its creating. I used below property and it seems fine --conf "spark.hadoop.parquet.enable.summary-metadata=false" When the application starts no _spark_metadata folder is generated. But once it moves to RUNNING status and starts processing messages, it's failing with the below error saying _spark_metadata folder doesn't exist. Seems structured stream is relying on this folder without

How to copy and convert parquet files to csv

两盒软妹~` 提交于 2020-01-02 04:45:27
问题 I have access to a hdfs file system and can see parquet files with hadoop fs -ls /user/foo How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row. 回答1: Try var df = spark.read.parquet("/path/to/infile.parquet") df.write.csv("/path/to/outfile.csv") Relevant API documentation: pyspark.sql.DataFrameReader.parquet pyspark.sql.DataFrameWriter.csv Both /path/to/infile.parquet and /path/to

load parquet file and keep same number hdfs partitions

只谈情不闲聊 提交于 2020-01-02 00:32:09
问题 I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M. Total size hdfs dfs -du -s -h /df 5.1 G 15.3 G /df hdfs dfs -du -h /df 43.6 M 130.7 M /df/pid=0 43.5 M 130.5 M /df/pid=1 ... 43.6 M 130.9 M /df/pid=119 I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions. df = spark.read.parquet('df') df.rdd.getNumPartitions() 60 HDFS settings: 'parquet

Problems saving partitioned parquet HIVE table from Spark

匆匆过客 提交于 2020-01-01 18:57:08
问题 Spark 1.6.0 Hive 1.1.0-cdh5.8.0 I have some problems saving my dataframe into parquet-backed partitioned Hive table from Spark. here is my code: val df = sqlContext.createDataFrame(rowRDD, schema) df.write .mode(SaveMode.Append) .format("parquet") .partitionBy("year") .saveAsTable(output) nothing special, actually, but I can't read any data from the table when it's generated. Key point is in partitioning - without it everything works fine. Here are my steps to fix the problem: At first, on