parquet | 易学教程

Spark: Parquet DataFrame operations fail when forcing schema on read

阅读更多关于 Spark: Parquet DataFrame operations fail when forcing schema on read

问题 (Spark 2.0.2) The problem here rises when you have parquet files with different schema and force the schema during read. Even though you can print the schema and run show() ok, you cannot apply any filtering logic on the missing columns. Here are the two example schemata: // assuming you are running this code in a spark REPL import spark.implicits._ case class Foo(i: Int) case class Bar(i: Int, j: Int) So Bar includes all the fields of Foo and adds one more ( j ). In real-life this arises

Pandas cannot read parquet files created in PySpark

阅读更多关于 Pandas cannot read parquet files created in PySpark

问题 I am writing a parquet file from a Spark DataFrame the following way: df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip") This creates a folder with multiple files in it. When I try to read this into pandas, I get the following errors, depending on which parser I use: import pandas as pd df = pd.read_parquet("path/myfile.parquet", engine="pyarrow") PyArrow: File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status ArrowIOError: Invalid parquet file. Corrupt

Impala: How to query against multiple parquet files with different schemata

阅读更多关于 Impala: How to query against multiple parquet files with different schemata

问题 in Spark 2.1 I often use something like df = spark.read.parquet(/path/to/my/files/*.parquet) to load a folder of parquet files even with different schemata. Then I perform some SQL queries against the dataframe using SparkSQL. Now I want to try Impala because I read the wiki article, which containing sentences like: Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop [...]. Reads Hadoop file formats,

How to Query parquet data from Amazon Athena?

阅读更多关于 How to Query parquet data from Amazon Athena?

问题 Athena creates a temporary table using fields in S3 table. I have done this using JSON data. Could you help me on how to create table using parquet data? I have tried following: Converted sample JSON data to parquet data. Uploaded parquet data to S3. Created temporary table using columns of JSON data. By doing this I am able to a execute query but the result is empty. Is this approach right or is there any other approach to be followed on parquet data? Sample json data: {"_id":

Saving empty DataFrame with known schema (Spark 2.2.1)

阅读更多关于 Saving empty DataFrame with known schema (Spark 2.2.1)

问题 Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records? def example(spark: SparkSession, path: String, schema: StructType) = { val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet") dataframeWriter.save(path) spark.read.load(path) // ERROR!! No files to read, so schema unknown } 回答1: This is the answer I

read a parquet files from HDFS using PyArrow

阅读更多关于 read a parquet files from HDFS using PyArrow

问题 I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet 's read_table() However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance. Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along. 回答1: Try fs = pa

How to convert an 500GB SQL table into Apache Parquet?

阅读更多关于 How to convert an 500GB SQL table into Apache Parquet?

问题 Perhaps this is well documented, but I am getting very confused how to do this (there are many Apache tools). When I create an SQL table, I create the table using the following commands: CREATE TABLE table_name( column1 datatype, column2 datatype, column3 datatype, ..... columnN datatype, PRIMARY KEY( one or more columns ) ); How does one convert this exist table into Parquet? This file is written to disk? If the original data is several GB, how long does one have to wait? Could I format the

How do I get schema / column names from parquet file?

阅读更多关于 How do I get schema / column names from parquet file?

问题 I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension. How do I get the schema / column names for this file? 回答1: You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files. And for the

spark & 文件压缩

阅读更多关于 spark & 文件压缩

hdfs中存储的文件一般都是多副本存储，对文件进行压缩，不仅可以节约大量空间，适当的存储格式还能对读取性能有非常大的提升。文本文件压缩 bzip2 压缩率最高，压缩解压速度较慢，支持split。 import org .apache .hadoop .io .compress .BZip 2Codec rdd .saveAsTextFile ( "codec/bzip2" ,classOf[BZip2Codec]) snappy json文本压缩率 38.2%，压缩和解压缩时间短。 import org .apache .hadoop .io .compress .SnappyCodec rdd .saveAsTextFile ( "codec/snappy" ,classOf[SnappyCodec]) gzip 压缩率高，压缩和解压速度较快，不支持split，如果不对文件大小进行控制，下次分析可能可能会造成效率低下的问题。 json文本压缩率23.5%，适合使用率低，长期存储的文件。 import org .apache .hadoop .io .compress .GzipCodec rdd .saveAsTextFile ( "codec/gzip" ,classOf[GzipCodec]) parquet文件压缩 parquet为文件提供了列式存储

Scala Spark - overwrite parquet file failed to delete file or dir

阅读更多关于 Scala Spark - overwrite parquet file failed to delete file or dir

问题 I'm trying to create parquet files for several days locally. The first time I run the code, everything works fine. The second time it fails to delete a file. The third time it fails to delete another file. It's totally random which file can not be deleted. The reason I need this to work is because I want to create parquet files everyday for the last seven days. So the parquet files that are already there should be overwritten with the updated data. I use Project SDK 1.8, Scala version 2.11.8