parquet

How to detect Parquet files?

家住魔仙堡 提交于 2021-02-20 02:13:40
问题 I have a script I am writing that will use either plain text or Parquet files. If it is a parquet file it will read it in using a dataframe and a few other things. On my cluster I am working on the first solution was the easiest and was if the extension of a file was .parquet if (parquetD(1) == "parquet") { if (args.length != 2) { println(usage2) System.exit(1) println(args) } } it would read it in with the dataframe. The problem is I have a bunch of files some people have created with no

How to detect Parquet files?

拥有回忆 提交于 2021-02-20 02:00:39
问题 I have a script I am writing that will use either plain text or Parquet files. If it is a parquet file it will read it in using a dataframe and a few other things. On my cluster I am working on the first solution was the easiest and was if the extension of a file was .parquet if (parquetD(1) == "parquet") { if (args.length != 2) { println(usage2) System.exit(1) println(args) } } it would read it in with the dataframe. The problem is I have a bunch of files some people have created with no

Spark compression when writing to external Hive table

情到浓时终转凉″ 提交于 2021-02-18 11:28:28
问题 I'm inserting into an external hive-parquet table from Spark 2.1 (using df.write.insertInto(...) . By setting e.g. spark.sql("SET spark.sql.parquet.compression.codec=GZIP") I can switch between SNAPPY,GZIP and uncompressed. I can verify that the file size (and filename ending) is influenced by these settings. I get a file named e.g. part-00000-5efbfc08-66fe-4fd1-bebb-944b34689e70.gz.parquet However if I work with partitioned Hive table , this setting does not have any effect, the file size is

Hive SQL查询效率提升之Analyze方案的实施

风格不统一 提交于 2021-02-17 13:50:16
0.简介 Analyze,分析表(也称为计算统计信息)是一种内置的Hive操作,可以执行该操作来收集表上的元数据信息。这可以极大的改善表上的查询时间,因为它收集构成表中数据的行计数,文件计数和文件大小(字节),并在执行之前将其提供给查询计划程序。 <!-- more --> 1.如何分析表? 基础分析语句 ANALYZE TABLE my_database_name.my_table_name COMPUTE STATISTICS; 这是一个基础分析语句,不限制是否存在表分区,如果你是分区表更应该定期执行。 分析特定分区 ANALYZE TABLE my_database_name.my_table_name PARTITION (YEAR=2019, MONTH=5, DAY=12) COMPUTE STATISTICS; 这是一个细粒度的分析语句。它收集指定的分区上的元数据,并将该信息存储在Hive Metastore中已进行查询优化。该信息包括每列,不同值的数量,NULL值的数量,列的平均大小,平均值或列中所有值的总和(如果类型为数字)和值的百分数。 分析列 ANALYZE TABLE my_database_name.my_table_name COMPUTE STATISTICS FOR column1, column2, column3; 它收集指定列上的元数据

Spark调优 | Spark SQL参数调优

|▌冷眼眸甩不掉的悲伤 提交于 2021-02-14 07:38:43
前言 Spark SQL里面有很多的参数,而且这些参数在Spark官网中没有明确的解释,可能是太多了吧,可以通过在spark-sql中使用set -v 命令显示当前spark-sql版本支持的参数。 本文讲解最近关于在参与hive往spark迁移过程中遇到的一些参数相关问题的调优。 内容分为两部分,第一部分讲遇到异常,从而需要通过设置参数来解决的调优;第二部分讲用于提升性能而进行的调优。 异常调优 spark.sql.hive.convertMetastoreParquet parquet是一种列式存储格式,可以用于spark-sql 和hive 的存储格式。在spark中,如果使用 using parquet 的形式创建表,则创建的是spark 的DataSource表;而如果使用 stored as parquet 则创建的是hive表。 spark.sql.hive.convertMetastoreParquet 默认设置是true, 它代表使用spark-sql内置的parquet的reader和writer(即进行反序列化和序列化),它具有更好地性能,如果设置为false,则代表使用 Hive的序列化方式。 但是有时候当其设置为true时,会出现使用hive查询表有数据,而使用spark查询为空的情况. 但是,有些情况下在将 spark.sql.hive

What is difference between dataframe created using SparkR and dataframe created using Sparklyr?

可紊 提交于 2021-02-11 12:32:33
问题 I am reading a parquet file in Azure databricks: Using SparkR > read.parquet() Using Sparklyr > spark_read_parquet() Both the dataframes are different, Is there any way to convert SparkR dataframe into the sparklyr dataframe and vice-versa ? 回答1: sparklyr creates tbl_spark. This is essentially just a lazy query written in Spark SQL. SparkR creates a SparkDataFrame which is more of a collection of data that is organized using a plan. In the same way you can't use a tbl as a normal data.frame

Apache Spark + Parquet not Respecting Configuration to use “Partitioned” Staging S3A Committer

妖精的绣舞 提交于 2021-02-11 12:31:30
问题 I am writing partitioned data (Parquet file) to AWS S3 using Apache Spark (3.0) from my local machine without having Hadoop installed in my machine. I was getting FileNotFoundException while writing to S3 when I have lot of files to write to around 50 partitions(partitionBy = date). Then I have come across new S3A committer, So I tried to configure "partitioned" committer instead. But still I could see that Spark uses ParquetOutputCommitter instead of PartitionedStagingCommitter when the file

Azure Data Factory - MS Access as Source Database - Error

有些话、适合烂在心里 提交于 2021-02-10 20:31:03
问题 My source is 'Access Database' Dynamically generating Source query as ' Select * from <tableName> ' But I got field names with spaces in source table, and destination is of type .parquet, Data Factory pipeline is failing with below error Example if Table Employee got a column 'First Name' { "errorCode": "2200", "message": "Failure happened on 'Sink' side. ErrorCode=UserErrorJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred

how to manage many avsc files in flink when consuming multiple topics gracefully

南笙酒味 提交于 2021-02-10 18:26:20
问题 Here is my case: I use flink to consume many topics in Kafka with SimpleStringSchema. OutputTag is used since we need to bucket the data in Parquet + Snappy into directories by topic later. Then we go through all the topics while each topic is processed with AVSC schema file. Now I have to modify the avsc schema file when some new columns added. It'll make me in trouble when ten or hundred files needed to modify. So is there a more graceful way to avoid changing the avsc file or how to manage

How to use Pyarrow to achieve stream writing effect

↘锁芯ラ 提交于 2021-02-10 17:48:02
问题 The data I have is a kind of streaming data. And I want to store them into a single Parquet file. But Pyarrow will overwrite the Parquet file everytime. So How should I do? I try not to close the writer but it seems impossible since If I didn't close it, then I could not read this file. Here is the package: import pyarrow.parquet as pp import pyarrow as pa for name in ['LEE','LSY','asd','wer']: writer=pq.ParquetWriter('d:/test.parquet', table.schema) arrays=[pa.array([name]),pa.array([2])]