parquet

Parquet的那些事(二)Spark中的Schema兼容问题

 ̄綄美尐妖づ 提交于 2020-03-16 17:49:23
Parquet是一种存储格式,其本身与任何语言、平台都没有关系,也不需要与任何一种数据处理框架绑定。但是一个开源技术的发展,必然需要有合适的生态圈助力才行,Spark便是Parquet的核心助力之一。作为内存型并行计算引擎,Spark被广泛应用在流处理、离线处理等场景,其从1.0.0便开始支持Parquet,方便我们操作数据。 在Spark中操作Parquet文件有两种方式,一种是直接加载文件,另一种是透过Hive表来读取数据。我们姑且称之为文件加载、表加载。这两种方式在API层面都非常简洁,它隐藏了底层推导Schema、并行加载数据等细节。 # By File df = spark . read . parquet ( "s3://mydata/type=security" ) # By Table df = spark . read . table ( "data_mine.security_log" ) 在实际使用中,我们经常会遇到 Schema兼容 的问题,其根源是Schema不一致,主要有以下两种情况: 存放在HDFS/S3上面的Parquet文件具有不同的Schema Hive Metastore Schema与Parquet文件自带的Schema不一致 不管是需求变化、产品迭代还是其他原因,总是会出现Schema变化的情况,导致不同Parquet文件的Schema不同

Python: save pandas data frame to parquet file

。_饼干妹妹 提交于 2020-03-13 05:55:08
问题 Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process? The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks! 回答1: Pandas has a core function to_parquet() . Just write the dataframe to parquet format like this: df.to_parquet('myfile.parquet') You still need to install a parquet library such as fastparquet . If you have more than one parquet library installed, you also

How to avoid small file problem while writing to hdfs & s3 from spark-sql-streaming

。_饼干妹妹 提交于 2020-03-08 09:14:46
问题 Me using spark-sql-2.3.1v , kafka with java8 in my project. With --driver-memory 4g \ --driver-cores 2 \ --num-executors 120 \ --executor-cores 1 \ --executor-memory 768m \ At consumer side , me trying to write the files in hdfs Me using something like this below code dataSet.writeStream() .format("parquet") .option("path", parqetFileName) .option("mergeSchema", true) .outputMode("Append") .partitionBy("company_id","date") .option("checkpointLocation", checkPtLocation) .trigger(Trigger

How to convert multiple parquet files into TFrecord files using SPARK?

不羁岁月 提交于 2020-02-28 17:24:08
问题 I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy() . I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps: Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files. Read those parquet files to

Spark DataFrame Repartition and Parquet Partition

痴心易碎 提交于 2020-02-26 10:17:37
问题 I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions? When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write? How is bucketing a dataframe using a column id and

How to read parquet file with a condition using pyarrow in Python

时间秒杀一切 提交于 2020-02-26 10:04:46
问题 I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC. import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import pyodbc def write_to_parquet(df, out_path, compression='SNAPPY'): arrow_table = pa.Table.from_pandas(df) if compression == 'UNCOMPRESSED': compression = None pq.write_table(arrow_table, out_path,

How to read parquet file with a condition using pyarrow in Python

时光毁灭记忆、已成空白 提交于 2020-02-26 10:04:35
问题 I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC. import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import pyodbc def write_to_parquet(df, out_path, compression='SNAPPY'): arrow_table = pa.Table.from_pandas(df) if compression == 'UNCOMPRESSED': compression = None pq.write_table(arrow_table, out_path,

Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

时光总嘲笑我的痴心妄想 提交于 2020-02-26 06:54:54
问题 Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table. My hope is that in the answers to this question we can aggregate information that would be helpful to Spark developers who want more control over how Spark saves tables and, perhaps, provide a foundation for improving Spark's documentation. 回答1: The reason you don't see options documented

Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

自闭症网瘾萝莉.ら 提交于 2020-02-26 06:53:45
问题 Neither the developer nor the API documentation includes any reference about what options can be passed in DataFrame.saveAsTable or DataFrameWriter.options and they would affect the saving of a Hive table. My hope is that in the answers to this question we can aggregate information that would be helpful to Spark developers who want more control over how Spark saves tables and, perhaps, provide a foundation for improving Spark's documentation. 回答1: The reason you don't see options documented

Are parquet file created with pyarrow vs pyspark compatible?

…衆ロ難τιáo~ 提交于 2020-02-25 06:03:40
问题 I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing df.repartition(*partitionby).write.partitionBy(partitionby). mode("append").parquet(output,compression=codec) however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So,