parquet

How to handle small file problem in spark structured streaming?

拥有回忆 提交于 2021-02-06 02:59:49
问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

Partition column is moved to end of row when saving a file to Parquet

我只是一个虾纸丫 提交于 2021-02-04 18:17:13
问题 For a given DataFrame just before being save 'd to parquet here is the schema: notice that the centroid0 is the first column and is StringType : However when saving the file using: df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath) and with the partitionCols as centroid0 : then there is a (to me) surprising result: the centroid0 partition column has been moved to the end of the Row the data type has been changed to Integer I confirmed the

GUI or CLI to create parquet file

你。 提交于 2021-01-29 13:14:38
问题 I want to provide the people I work with, a tool to create parquet files to be use for unit-tests of modules that read and process such files. I use ParquetViewer to view the content of parquet files, but I like to have a tool to make (sample) parquet files. Is there such a tool to create parquet file with a GUI or some practical CLI otherwise? Note: I would prefer a cross-platform solution, but if not I am looking for a windows/mingw solution in order to use it at work - where I cannot

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

拥有回忆 提交于 2021-01-29 07:37:03
问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

让人想犯罪 __ 提交于 2021-01-29 07:31:20
问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

how to read hdfs file with wildcard character used by pyspark

这一生的挚爱 提交于 2021-01-29 05:18:32
问题 there are some parquet file paths are: /a/b/c='str1'/d='str' /a/b/c='str2'/d='str' /a/b/c='str3'/d='str' I want to read the parquet files like this: df = spark.read.parquet('/a/b/c='*'/d='str') but it doesn't work by using "*" wildcard character.How can I do that? thank you for helping 回答1: You need to escape single quotes: df = spark.read.parquet('/a/b/c=\'*\'/d=\'str\'') ... or just use double quotes: df = spark.read.parquet("/a/b/c='*'/d='str'") 来源: https://stackoverflow.com/questions

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

…衆ロ難τιáo~ 提交于 2021-01-28 19:57:01
问题 I'm trying to load parquet file using manifest file and getting below error. query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: ) Here is my copy command copy testtable from 's3://sbredshift-east/manifest/supplier.manifest' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET manifest; here is my manifest file **{ "entries":[ { "url":"s3://sbredshift-east/data/000002_0", "mandatory"

Rename Column in Athena

时光总嘲笑我的痴心妄想 提交于 2021-01-28 14:27:05
问题 Athena table "organization" reads data from parquet files in s3. I need to change a column name from "cost" to "fee" . The data files goes back to Jan 2018. If I just rename the column in Athena , table won't be able to find data for new column in parquet file. Please let me know if there ways to resolve it. 回答1: You have to change the schema and point to new column "fee" But it depends on ur situation. If you have two data sets, in one dataset it is called "cost" and in another dataset it is

How to rename AWS Athena columns with parquet file source?

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-28 09:28:59
问题 I have data loaded in my S3 bucket folder as multiple parquet files. After loading them into Athena I can query the data successfully. What are the ways to rename the Athena table columns for parquet file source and still be able to see the data under renamed column after querying? Note: checked with edit schema option, column is getting renamed but after querying you will not see data under that column. 回答1: There is as far as I know no way to create a table with different names for the

Do Spark/Parquet partitions maintain ordering?

流过昼夜 提交于 2021-01-28 03:09:44
问题 If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code: # read a csv df = sql_context.read.csv(input_filename) # add a hash column hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType()) df = df.withColumn('hash', hash_udf(df['customer_id'])) # write out to parquet df.write.parquet(output_path, partitionBy=['hash']) # read back the file df2 = sql_context.read.parquet(output_path) I am partitioning on a