parquet

Spark partitionBy much slower than without it

房东的猫 提交于 2019-12-04 13:40:32
问题 I tested writing with: df.write.partitionBy("id", "name") .mode(SaveMode.Append) .parquet(filePath) However if I leave out the partitioning: df.write .mode(SaveMode.Append) .parquet(filePath) It executes 100x(!) faster. Is it normal for the same amount of data to take 100x longer to write when partitioning? There are 10 and 3000 unique id and name column values respectively. The DataFrame has 10 additional integer columns. 回答1: The first code snippet will write a parquet file per partition to

Is gzipped Parquet file splittable in HDFS for Spark?

假装没事ソ 提交于 2019-12-04 11:32:57
问题 I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv? 回答1: Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm. This fact is

Unable to infer schema when loading Parquet file

筅森魡賤 提交于 2019-12-04 09:59:46
问题 response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) But then: outcome2 = sqlc.read.parquet(response) # fail fails with: AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' in /usr/local

Convert JSON to Parquet

戏子无情 提交于 2019-12-04 09:46:28
问题 I have a few TB logs data in JSON format, I want to convert them into Parquet format to gain better performance in analytics stage. I've managed to do this by writing a mapreduce java job which uses parquet-mr and parquet-avro. The only thing I'm not satisfied with is that, my JSON logs doesn't have a fixed schema, I don't know all the fields' names and types. Besides, even I know all the fields' names and types, my schema evolves as time goes on, for example, there will be new fields added

How to write TIMESTAMP logical type (INT96) to parquet, using ParquetWriter?

随声附和 提交于 2019-12-04 09:23:18
I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files. Currently, it only handles int32 , double , and string I need to support the parquet timestamp logical type (annotated as int96), and I am lost on how to do that because I can't find a precise specification online. It appears this timestamp encoding (int96) is rare and not well supported. I've found very little specification details online. This github README states that: Timestamps saved as an int96 are made up of the nanoseconds in the day (first 8 byte) and the Julian day (last

Unable to get parquet-tools working from the command-line

不羁岁月 提交于 2019-12-04 09:05:57
I'm attempting to get the newest version of parquet-tools running, but I'm having some issues. For some reason org.apache.hadoop.conf.Configuration isn't in the shaded jar. (I have the same issue with v1.6.0 as well). Is there something beyond mvn package or mvn install that I should be doing? (The actual mvn invocation I'm using is mvn install -DskipTests -pl \!parquet-thrift,\!parquet-cascading,\!parquet-pig-bundle,\!parquet-pig,\!parquet-scrooge,\!parquet-hive,\!parquet-protobuf ). This works just fine, and the tests pass if I choose to run them. The error I get is below (You can see I've

Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS?

给你一囗甜甜゛ 提交于 2019-12-04 08:48:11
I've been hunting around for a solution to this question. It appears to me that there is no way to embed reading and writing Parquet format in a Java program without pulling in dependencies on HDFS and Hadoop. Is this correct? I want to read and write on a client machine, outside of a Hadoop cluster. I started to get excited about Apache Drill, but it appears that it must run as a separate process. What I need is an in-process ability to read and write a file using the Parquet format. Krishas You can write parquet format out side hadoop cluster using java Parquet Client API. Here is a sample

使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集

不想你离开。 提交于 2019-12-04 08:38:36
0. Data Lake Analytics(DLA)简介 关于Data Lake的概念,更多阅读可以参考: https://en.wikipedia.org/wiki/Data_lake 以及AWS和Azure关于Data Lake的解读: https://amazonaws-china.com/big-data/datalakes-and-analytics/what-is-a-data-lake/ https://azure.microsoft.com/en-us/solutions/data-lake/ 终于,阿里云现在也有了自己的数据湖分析产品: https://www.aliyun.com/product/datalakeanalytics 可以点击申请使用(目前公测阶段还属于邀测模式,我们会尽快审批申请),体验本教程的TPC-H CSV数据格式的数据分析之旅。 产品文档: https://help.aliyun.com/product/70174.html 1. 开通Data Lake Analytics与OSS服务 如果您已经开通,可以跳过该步骤。如果没有开通,可以参考: https://help.aliyun.com/document_detail/70386.html 进行产品开通服务申请。 2. 下载TPC-H测试数据集 可以从这下载TPC-H

While Reading a specific Parquet Column , All column are read instead of single column which is given in Parquet-Sql

我是研究僧i 提交于 2019-12-04 08:36:29
I have read in Parquet Documentation that Only the column I query, the data of that column is read and processed. But When I see the Spark-UI I am finding that complete file is read. Following is the code to write parquet file and read in Spark-Sql. object ParquetFileCreator_simple { def datagenerate(schema: Schema, ind: Long): GenericRecord ={ var data: GenericRecord = new GenericData.Record(schema) data.put("first", "Pg20 " + ind ) data.put("appType", "WAP" + ind) data } def main (args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[2]").set("spark.app.name", "merger")

Append new data to partitioned parquet files

我怕爱的太早我们不能终老 提交于 2019-12-04 08:07:14
问题 I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them and apply a schema, then perform my transformations. My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe. Here is my save line: data .filter(validPartnerIds($"partnerID")) .write .partitionBy(