parquet | 易学教程

pySpark: java.lang.UnsupportedOperationException: Unimplemented type: StringType

阅读更多关于 pySpark: java.lang.UnsupportedOperationException: Unimplemented type: StringType

问题 While reading inconsistent schema written group of parquet files, we have issue on schema merging. On switching to manually specifying schema i get following error. Any pointer will be helpful. java.lang.UnsupportedOperationException: Unimplemented type: StringType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readDoubleBatch(VectorizedColumnReader.java:389) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch

How to read a parquet bytes object in python

阅读更多关于 How to read a parquet bytes object in python

问题 I have a python object which I know this is a parquet file loaded to the object. (I do not have the possibility to actually read it from a file). The object var_1 contains b'PAR1\x15\x....1\x00PAR1 when I check the type: type(var_1) I get the result is bytes Is there a way to read this ? say into a pandas data-frame ? I have tried: 1) from fastparquet import ParquetFile pf = ParquetFile(var_1) And got: TypeError: a bytes-like object is required, not 'str' 2 import pyarrow.parquet as pq

Using parquet-mr in Scala without Spark

阅读更多关于 Using parquet-mr in Scala without Spark

问题 I'm trying to read a .parquet file in Scala without using Spark. I found this SO post, but so far have been unable to find how to use the parquet-mr library to actually read from a file (including getting the schema). There are things like RecordReader.java and RecordReaderImplementation.java (which extends RecordReader), but I'm struggling to understand how to use these in my Scala code. I'm very new to Scala and the Parquet format, but would like to accomplish this without using Spark. What

DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

阅读更多关于 DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

问题 I wrote a DataFrame with pySpark into HDFS with this command: df.repartition(col("year"))\ .write.option("maxRecordsPerFile", 1000000)\ .parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy') When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found. Whats going wrong here, am I missing something? Interestingly, df.write.format('parquet').saveAsTable

Collecting Parquet data from HDFS to local file system

阅读更多关于 Collecting Parquet data from HDFS to local file system

问题 Given a Parquet dataset distributed on HDFS (metadata file + may .parquet parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ... doesn't work - it merges metadata with actual parquet files.. 回答1: There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist. spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS") spark> parquet.repartition(1)

External Table not getting updated from parquet files written by spark streaming

阅读更多关于 External Table not getting updated from parquet files written by spark streaming

问题 I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like : CREATE TABLE if not exists rolluptable USING org.apache.spark.sql.parquet OPTIONS ( path "hdfs:////" ); I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up. Dropping and recreating the table every time works fine

Error opening a parquet file on Amazon S3 using pyarrow

阅读更多关于 Error opening a parquet file on Amazon S3 using pyarrow

问题 I have this code, which is supposed to read a single column data from a parquet file stored on S3: fs = s3fs.S3FileSystem() data_set = pq.ParquetDataset(f"s3://{bucket}/{key}", filesystem=fs) column_data = data_set.read(columns=[col_name]) and I get this excption: validate_schemas self.schema = self.pieces[0].get_metadata(open_file).schema IndexError: list index out of range I upgraded to the latest version of pyarrow but it did not help 来源： https://stackoverflow.com/questions/52057964/error

Is it possible to remove files from Spark Streaming folder?

阅读更多关于 Is it possible to remove files from Spark Streaming folder?

问题 Spark 2.1, ETL process convert files from source systems into parquet and put small parquets in folder1. Spark streaming on folder1 is working OK, but parquet files in folder1 too small for HDFS. We have to merge small parquet files in bigger one, but when I try to remove files from folder1, spark streaming process rise exception: 17/07/26 17:16:23 ERROR StreamExecution: Query [id = f29783ea-bdfb-4b59-a6f6-b77f79509a5a, runId = cbcce2b2-7d7b-4e31-a15a-7efed420f974] terminated with error java

Generating parquet files - differences between R and Python

阅读更多关于 Generating parquet files - differences between R and Python

问题 We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet ) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations? 回答1: (only answering to 1), please post separate questions to make it easier to answer) _metadata

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

阅读更多关于 Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

问题 I have a multi-million record SQL table that I'm planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file. However, I can't seem to find an API or parameter with the pyarrow library that allows me to specify something like: file_scheme="hive" As is supported by the fastparquet python library. Here's my sample code: #!/usr/bin/python import pyodbc import pandas as pd import pyarrow as pa import