parquet

Are parquet file created with pyarrow vs pyspark compatible?

点点圈 提交于 2020-02-25 06:03:39
问题 I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing df.repartition(*partitionby).write.partitionBy(partitionby). mode("append").parquet(output,compression=codec) however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So,

S3 Implementation for org.apache.parquet.io.InputFile?

自作多情 提交于 2020-02-24 12:07:55
问题 I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files. I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated

S3 Implementation for org.apache.parquet.io.InputFile?

时光毁灭记忆、已成空白 提交于 2020-02-24 12:05:10
问题 I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files. I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated

Read local Parquet file without Hadoop Path API

与世无争的帅哥 提交于 2020-02-23 04:15:34
问题 I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path as input (even for pointing to a local file). This has been asked several times, but quite long ago, and all answers are coupled to Hadoop. ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build(); GenericRecord nextRecord = reader.read(); is the most popular answer in how to read a parquet file, in a standalone java code?,

Get Schema of Parquet file without loading file into spark data frame in python?

冷暖自知 提交于 2020-02-20 11:40:19
问题 Is there any python library that can be used to just get the schema of parquet file. Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema. 回答1: This is supported by using pyarrow (https://github.com/apache/arrow/). from pyarrow

Get Schema of Parquet file without loading file into spark data frame in python?

﹥>﹥吖頭↗ 提交于 2020-02-20 11:40:09
问题 Is there any python library that can be used to just get the schema of parquet file. Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema. 回答1: This is supported by using pyarrow (https://github.com/apache/arrow/). from pyarrow

第 4 章 SparkSQL数据源

白昼怎懂夜的黑 提交于 2020-02-11 14:23:44
上篇: 第 3章 IDEA创建SparkSQL程序 通用加载/保存方法 1、手动指定选项 Spark SQL的DataFrame接口支持多种数据源的操作。一个DataFrame可以进行RDDs方式的操作,也可以被注册为临时表。把DataFrame注册为临时表之后,就可以对该DataFrame执行SQL查询。 Spark SQL的默认数据源为Parquet格式。数据源为Parquet文件时,Spark SQL可以方便的执行所有的操作。修改配置项spark.sql.sources.default,可修改默认数据源格式。 读取 //查看文件格式 scala > spark . read . csv format jdbc json load option options orc parquet schema table text textFile // scala > spark . read . load ( "file:///usr/local/hadoop/module/datas/2.json" ) 报错信息: 查看spark文件信息: 尝试读取 users.parque t这个文件的信息: scala > spark . read . load ( "file:///usr/local/hadoop/Spark/spark-2.1.1-bin-hadoop2.7

搭建Hive所遇到的坑

江枫思渺然 提交于 2020-02-10 13:10:26
##一.基本功能: 1.启动hive时报错 java.lang.ExceptionInInitializerError at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher.init(JDBCStatsPublisher.java:265) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:412) Caused by: java.lang.SecurityException: sealing violation: package org.apache.derby.impl.jdbc.authentication is sealed at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:388) at java.net.URLClassLoader.defineClass(URLClassLoader.java:417) 解决方案: 将mysql-connector-java-5

Hadoop-Impala学习笔记之入门

怎甘沉沦 提交于 2020-02-10 12:49:38
CDH quickstart vm包含了单节点的全套hadoop服务生态,可从https://www.cloudera.com/downloads/quickstart_vms/5-13.html下载。如下: 对应的节点如下(不包含Cloudera Navigator): 要学习完整的hadoop生态,最好是使用8C/32GB以上的服务器,4C/16GB勉强能跑、但是很勉强(最好使用2个以上节点)。 impala 使用c++编写(Spark使用Scala编写),采用MPP架构(类似于MariaDB Columnstore,也就是之前的infinidb),由下列组件组成: Hue是一个Web智能查询分析器,能够进行语法提示,查询Impala、HDFS、HBase。如下: 其中impala服务器由Impala Daemon(执行SQL)、Impala Statestore(监控Daemon状态)、Impala Catalog(将DDL变更传输给Daemon节点,避免了DDL通过Impala执行时运行REFRESH/INVALIDATE METADATA的必要,通过Hive时,仍然需要)组成。impala-shell和mysql客户端类似,执行SQL。 Impala使用和Hive一样的元数据,其可以存储在mysql或postgresql中,称为metastore。

14、Hive压缩、存储原理详解与实战

空扰寡人 提交于 2020-02-03 18:40:58
1、Hive 压缩 1.1数据压缩说明 压缩模式评价: (1)压缩比 (2)压缩时间 (3)已经压缩的是否可以再分割;可以分割的格式允许单一文件有多个Mapper程序处理,才可以更好的并行化。 Hadoop编码/解码器方式: 1.2数据压缩使用 压缩模式评价 可使用以下三种标准对压缩方式进行评价 1 、压缩比:压缩比越高,压缩后文件越小,所以压缩比越高越好 2、压缩时间:越快越好 3、已经压缩的格式文件是否可以再分割:可以分割的格式允许单一文件由多个Mapper程序处理,可以更好的并行化 常见压缩格式 压缩方式 压缩比 压缩速度 解压缩速度 是否可分割 gzip 13.4% 21 MB/s 118 MB/s 否 bzip2 13.2% 2.4MB/s 9.5MB/s 是 lzo 20.5% 135 MB/s 410 MB/s 是 snappy 22.2% 172 MB/s 409 MB/s 否 Hadoop编码/解码器方式 压缩格式 对应的编码/解码器 DEFLATE org.apache.hadoop.io.compress.DefaultCodec Gzip org.apache.hadoop.io.compress.GzipCodec BZip2 org.apache.hadoop.io.compress.BZip2Codec LZO com.hadoop.compress