parquet

How to avoid reading old files from S3 when appending new data?

心不动则不痛 提交于 2019-12-01 14:04:53
Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading 16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key 'foo.parquet/id=123/day=2016-11-26

Query A Nested Array in Parquet Records

依然范特西╮ 提交于 2019-12-01 13:21:47
I am trying different ways to query a record within a array of records and display complete Row as output. I dont know which nested Object has String "pg". But i want to query on particular object. Whether the object has "pg" or not. If "pg" exist then i want to display that complete row. How to write "spark sql query" on nested objects without specfying the object index.So i dont want to use the index of children.name My Avro Record: { "name": "Parent", "type":"record", "fields":[ {"name": "firstname", "type": "string"}, { "name":"children", "type":{ "type": "array", "items":{ "name":"child",

Read multiple parquet files in a folder and write to single csv file using python

怎甘沉沦 提交于 2019-12-01 12:56:59
I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder. I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files. I learnt to convert single parquet to csv file using pyarrow with the following code: import pandas as pd df = pd.read_parquet('par_file

Read specific column from Parquet without using Spark

若如初见. 提交于 2019-12-01 12:45:27
I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code: import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.avro.generic.GenericRecord import org.apache.parquet.hadoop.ParquetReader import org.apache.parquet.avro.AvroParquetReader object parquetToJson{ def main (args : Array[String]):Unit= { //case class Customer(key: Int, name: String, sellAmount: Double, profit: Double,

Read specific column from Parquet without using Spark

和自甴很熟 提交于 2019-12-01 11:21:39
问题 I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code: import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.avro.generic.GenericRecord import org.apache.parquet.hadoop.ParquetReader import org.apache.parquet.avro.AvroParquetReader object parquetToJson{ def main (args :

Achieve concurrency when saving to a partitioned parquet file

て烟熏妆下的殇ゞ 提交于 2019-12-01 10:49:20
When writing a dataframe to parquet using partitionBy : df.write.partitionBy("col1","col2","col3").parquet(path) It would be my expectation that each partition being written were done independently by a separate task and in parallel to the extent of the number of workers assigned to the current spark job. However there is actually only one worker/task running at a time when writing to the parquet. That one worker is cycling through each of the partitions and writing out the .parquet files serially. Why would this be the case - and is there a way to compel concurrency in this spark.write

Parquet-backed Hive table: array column not queryable in Impala

僤鯓⒐⒋嵵緔 提交于 2019-12-01 09:17:32
Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps. I notice that Impala, as of CDH5.5 , now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news! As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:

How to change the location of _spark_metadata directory?

风流意气都作罢 提交于 2019-12-01 08:49:12
I am using Spark Structured Streaming's streaming query to write parquet files to S3 using the following code: ds.writeStream().format("parquet").outputMode(OutputMode.Append()) .option("queryName", "myStreamingQuery") .option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/") .option("path", "s3a://my-data-output-bucket-name/") .partitionBy("createdat") .start(); I get the desired output in the s3 bucket my-data-output-bucket-name but along with the output, I get the _spark_metadata folder in it. How to get rid of it? If I can't get rid of it, how to change it's location to a

Query A Nested Array in Parquet Records

雨燕双飞 提交于 2019-12-01 08:22:50
问题 I am trying different ways to query a record within a array of records and display complete Row as output. I dont know which nested Object has String "pg". But i want to query on particular object. Whether the object has "pg" or not. If "pg" exist then i want to display that complete row. How to write "spark sql query" on nested objects without specfying the object index.So i dont want to use the index of children.name My Avro Record: { "name": "Parent", "type":"record", "fields":[ {"name":

spark 2.3.0, parquet 1.8.2 - statistics for a binary field does't exist in resulting file from spark write?

拟墨画扇 提交于 2019-12-01 07:24:39
问题 On spark master branch - I tried to write single column with "a", "b", "c" to parquet file f1 scala> List("a", "b", "c").toDF("field1").coalesce(1).write.parquet("f1") But saved file does not have statistics (min, max) $ ls f1/*.parquet f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet $ parquet-tool meta f1/*.parquet file: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet creator: parquet-mr version 1.8.2 (build