parquet

Does Spark maintain parquet partitioning on read?

时光怂恿深爱的人放手 提交于 2019-12-18 15:28:41
问题 I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below: df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file") Now later on I would like to read the parquet file so I do something like this: val df = spark.read.parquet("/path/to/parquet/file") Is the dataframe partitioned by "DATE" ? In other words if a parquet

How do I read a Parquet in R and convert it to an R DataFrame?

五迷三道 提交于 2019-12-18 11:01:36
问题 I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr 回答1: You can use the arrow package for this. It is the same thing as in Python pyarrow but this nowadays also comes packaged for R without the need for Python. As it is not yet available on CRAN, you

dask dataframe read parquet schema difference

元气小坏坏 提交于 2019-12-18 07:09:28
问题 I do the following: import dask.dataframe as dd from dask.distributed import Client client = Client() raw_data_df = dd.read_csv('dataset/nyctaxi/nyctaxi/*.csv', assume_missing=True, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']) The dataset is taken out of a presentation Mathew Rocklin has made and was used as a dask dataframe demo. Then I try to write it to parquet using pyarrow raw_data_df.to_parquet(path='dataset/parquet/2015.parquet/') # only pyarrow is installed Trying to

How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)

时光毁灭记忆、已成空白 提交于 2019-12-18 04:23:24
问题 I recently had a requirement where I needed to generate Parquet files that could be read by Apache Spark using only Java (Using no additional software installations such as: Apache Drill, Hive, Spark, etc.). The files needed to be saved to S3 so I will be sharing details on how to do both. There were no simple to follow guides on how to do this. I'm also not a Java programmer so the concepts of using Maven, Hadoop, etc. were all foreign to me. So it took me nearly two weeks to get this

Writing RDD partitions to individual parquet files in its own directory

强颜欢笑 提交于 2019-12-18 03:36:09
问题 I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be: <root> <entity=entity1> <year=2015> <week=45> data_file.parquet Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else. ​As a preceding step I have all the data loaded

Hive doesn't read partitioned parquet files generated by Spark

£可爱£侵袭症+ 提交于 2019-12-18 01:15:14
问题 I'm having a problem to read partitioned parquet files generated by Spark in Hive. I'm able to create the external table in hive but when I try to select a few lines, hive returns only an "OK" message with no rows. I'm able to read the partitioned parquet files correctly in Spark, so I'm assuming that they were generated correctly. I'm also able to read these files when I create an external table in hive without partitioning. Does anyone have a suggestion? My Environment is: Cluster EMR 4.1.0

Write parquet from AWS Kinesis firehose to AWS S3

≯℡__Kan透↙ 提交于 2019-12-17 22:35:39
问题 I would like to ingest data into s3 from kinesis firehose formatted as parquet. So far I have just find a solution that implies creating an EMR, but I am looking for something cheaper and faster like store the received json as parquet directly from firehose or use a Lambda function. Thank you very much, Javi. 回答1: Good news, this feature was released today! Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data

Schema evolution in parquet format

落花浮王杯 提交于 2019-12-17 17:49:06
问题 Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving forward our concern is still schema evolution . Does anyone know if schema evolution is possible in parquet, if yes How is it possible, if no then Why not. Some resources claim that it is possible but it can only add columns at end . What does this

How to read a Parquet file into Pandas DataFrame?

女生的网名这么多〃 提交于 2019-12-17 17:38:46
问题 How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark. I thought Blaze/Odo would have made this

Reading DataFrame from partitioned parquet file

心不动则不痛 提交于 2019-12-17 15:39:11
问题 How to read partitioned parquet with condition as dataframe, this works fine, val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") Partition is there for day=1 to day=30 is it possible to read something like (day = 5 to 6) or day=5,day=6 , val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*") If I put * it gives me all 30 days