parquet

Why does Apache Spark read unnecessary Parquet columns within nested structures?

不羁的心 提交于 2019-12-20 17:18:04
问题 My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a

Generate metadata for parquet files

这一生的挚爱 提交于 2019-12-20 11:59:29
问题 I have a hive table that is built on top of a load of external parquet files. Paruqet files should be generated by the spark job, but due to setting metadata flag to false they were not generated. I'm wondering if it is possible to restore it in some painless way. The structure of files is like follows: /apps/hive/warehouse/test_db.db/test_table/_SUCCESS /apps/hive/warehouse/test_db.db/test_table/_common_metadata /apps/hive/warehouse/test_db.db/test_table/_metadata /apps/hive/warehouse/test

Methods for writing Parquet files using Python?

元气小坏坏 提交于 2019-12-20 11:24:07
问题 I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it. Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame Parquet support. I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql ? 回答1: Update (March 2017): There are

Methods for writing Parquet files using Python?

时间秒杀一切 提交于 2019-12-20 11:24:06
问题 I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it. Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame Parquet support. I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql ? 回答1: Update (March 2017): There are

Creating hive table using parquet file metadata

二次信任 提交于 2019-12-20 09:57:09
问题 I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet. Output from writing parquet write _common_metadata part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet _SUCCESS _metadata part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet Hive table CREATE TABLE testhive ROW FORMAT SERDE 'org.apache

How to match Dataframe column names to Scala case class attributes?

主宰稳场 提交于 2019-12-20 09:47:16
问题 The column names in this example from spark-sql come from the case class Person . case class Person(name: String, age: Int) val people: RDD[Person] = ... // An RDD of case class objects, from the previous example. // The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet. people.saveAsParquetFile("people.parquet") https://spark.apache.org/docs/1.1.0/sql-programming-guide.html However in many cases the parameter names may be changed. This

Does Spark support true column scans over parquet files in S3?

不羁的心 提交于 2019-12-20 09:28:21
问题 One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest. Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.

How to get list of all columns from a parquet file using s3 select?

99封情书 提交于 2019-12-20 07:19:13
问题 I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows wihtout any column headers. Is there anyway to get all column names from this parquet file without downloading it completely? Since parquet file can be very large, I would not want to download the entire parquet file which is why I am using s3 select to pick first few rows using select * from S3Object LIMIT 10 I tried to fetch column

Spark Exception Complex types not supported while loading parquet

霸气de小男生 提交于 2019-12-20 04:41:16
问题 I am trying to load Parquet File in Spark as dataframe- val df= spark.read.parquet(path) I am getting - org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported. While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)- Type t = requestedSchema

Spark Exception Complex types not supported while loading parquet

点点圈 提交于 2019-12-20 04:41:10
问题 I am trying to load Parquet File in Spark as dataframe- val df= spark.read.parquet(path) I am getting - org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported. While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)- Type t = requestedSchema