parquet

Spark Exception Complex types not supported while loading parquet

假装没事ソ 提交于 2019-12-02 03:46:23
I am trying to load Parquet File in Spark as dataframe- val df= spark.read.parquet(path) I am getting - org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported. While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)- Type t = requestedSchema.getFields().get(i); if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) { throw new

Hive的初步了解

拥有回忆 提交于 2019-12-02 03:02:48
因为工作中需求到对hive内的数据进行操作,所以了解一下hive数据库的相关知识,防止自己删库跑路 hive的数据类型 整型: TINYINT 微整型 1字节 SMALLINT 小整型 2字节 INT 整型 4字节 BIGINT 8字节 布尔型: BOOLEAN TRUE/FALSE 浮点型: FLOAT 单精度 DOUBLE 双精度 字符串类型: STRING 不设定长度 复合型数据类型: ARRAY 一组有序字段,字段的类型必须相同 MAP 一组无序的键值对 STUCTS 一组命名的字段 1 CREATE TABLE complex( 2 3 col1 ARRAY< INT>, 4 5 col2 MAP< STRING,INT>, 6 7 col3 STRUCT< a:STRING,b:INT,c:DOUBLE> 8 9 ) hive的存储类型 hive存储数据主要有两种方式列存储和行存储 hive常用的存储格式 1.textfile 默认格式 为行存储格式 2.ORCFile hive/spark都支持这种存储格式,它存储的方式是采用数据按照行分块,每个块按照列存储,其中每个块都存储有一个索引。特点是数据压缩率非常高 3.Parquet Parquet也是一种行式存储,同时具有很好的压缩性能;同时可以减少大量的表扫描和反序列化的时间。 sql中如何定义存储格式 CREATE

Is it possible to load parquet table directly from file?

牧云@^-^@ 提交于 2019-12-02 01:28:35
If I have a binary data file(it can be converted to csv format), Is there any way to load parquet table directly from it? Many tutorials show loading csv file to text table, and then from text table to parquet table. From efficiency point of view, is it possible to load parquet table directly from either a binary file like what I already have? Ideally using create external table command. Or I need to convert it to csv file first? Is there any file format restriction? Unfortunately it is not possible to read from a custom binary format in Impala. You should convert your files to csv, then

Parquet Data timestamp columns INT96 not yet implemented in Druid Overlord Hadoop task

我的未来我决定 提交于 2019-12-02 00:58:09
Context: I am able to submit a MapReduce job from druid overlord to an EMR. My Data source is in S3 in Parquet format. I have a timestamp column (INT96) in parquet data which is not supported in Avroschema. Error is while parsing the timestamp Issue Stack trace is: Error: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert

parquet使用笔记

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-01 22:54:17
1、错误 Exception in thread "main" org.apache.parquet.column.statistics.StatisticsClassException: Statistics comparator mismatched: SIGNED_INT32_COMPARATOR vs. SIGNED_INT32_COMPARATOR 完整错误信息如下: Exception in thread "main" org.apache.parquet.column.statistics.StatisticsClassException: Statistics comparator mismatched: SIGNED_INT32_COMPARATOR vs. SIGNED_INT32_COMPARATOR at org.apache.parquet.column.statistics.StatisticsClassException.create(StatisticsClassException.java:42) at org.apache.parquet.column.statistics.Statistics.mergeStatistics(Statistics.java:327) at org.apache.parquet.hadoop

PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\\n\\t=”. Please use alias to rename it [duplicate]

雨燕双飞 提交于 2019-12-01 21:28:19
This question already has an answer here: Spark Dataframe validating column names for parquet writes (scala) 4 answers I'm trying to load Parquet data into PySpark , where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error propagating from the JVM side of PySpark . I've attached the stack trace below. Is there a way I can load this parquet file into PySpark , without pre-processing the data in Scala, and without modifying the source parquet file? ----

Spark Exception when converting a MySQL table to parquet

喜你入骨 提交于 2019-12-01 19:25:58
问题 I'm trying to convert a MySQL remote table to a parquet file using spark 1.6.2. The process runs for 10 minutes, filling up memory, than starts with these messages: WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@dac44da,BlockManagerId(driver, localhost, 46158))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at the end fails with this error

Spark Exception when converting a MySQL table to parquet

流过昼夜 提交于 2019-12-01 18:40:13
I'm trying to convert a MySQL remote table to a parquet file using spark 1.6.2. The process runs for 10 minutes, filling up memory, than starts with these messages: WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@dac44da,BlockManagerId(driver, localhost, 46158))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at the end fails with this error: ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriverActorSystem-scheduler-1] shutting

How to write Parquet metadata with pyarrow?

回眸只為那壹抹淺笑 提交于 2019-12-01 16:40:28
I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata , but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata , but this seems like an overkill, since my metadata is the same for all row groups in the file. Is there any way to write file-wide Parquet metadata with pyarrow? 来源: https://stackoverflow.com/questions/52122674/how-to-write-parquet-metadata-with

Spark SQL unable to complete writing Parquet data with a large number of shards

天涯浪子 提交于 2019-12-01 16:03:42
I am trying to use Apache Spark SQL to etl json log data in S3 into Parquet files also on S3. My code is basically: import org.apache.spark._ val sqlContext = sql.SQLContext(sc) val data = sqlContext.jsonFile("s3n://...", 10e-6) data.saveAsParquetFile("s3n://...") This code works when I have up to 2000 partitions and fails for 5000 or more, regardless of the volume of data. Normally one could just coalesce the partitions to an acceptable number, but this is a very large data set and at 2000 partitions I hit the problem describe in this question 14/10/10 00:34:32 INFO scheduler.DAGScheduler: