parquet

pySpark: java.lang.UnsupportedOperationException: Unimplemented type: StringType

泄露秘密 提交于 2019-12-11 09:05:47
问题 While reading inconsistent schema written group of parquet files, we have issue on schema merging. On switching to manually specifying schema i get following error. Any pointer will be helpful. java.lang.UnsupportedOperationException: Unimplemented type: StringType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readDoubleBatch(VectorizedColumnReader.java:389) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch

How to read a parquet bytes object in python

蹲街弑〆低调 提交于 2019-12-11 06:20:05
问题 I have a python object which I know this is a parquet file loaded to the object. (I do not have the possibility to actually read it from a file). The object var_1 contains b'PAR1\x15\x....1\x00PAR1 when I check the type: type(var_1) I get the result is bytes Is there a way to read this ? say into a pandas data-frame ? I have tried: 1) from fastparquet import ParquetFile pf = ParquetFile(var_1) And got: TypeError: a bytes-like object is required, not 'str' 2 import pyarrow.parquet as pq

Using parquet-mr in Scala without Spark

爱⌒轻易说出口 提交于 2019-12-11 05:35:17
问题 I'm trying to read a .parquet file in Scala without using Spark. I found this SO post, but so far have been unable to find how to use the parquet-mr library to actually read from a file (including getting the schema). There are things like RecordReader.java and RecordReaderImplementation.java (which extends RecordReader), but I'm struggling to understand how to use these in my Scala code. I'm very new to Scala and the Parquet format, but would like to accomplish this without using Spark. What

DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

情到浓时终转凉″ 提交于 2019-12-11 04:17:57
问题 I wrote a DataFrame with pySpark into HDFS with this command: df.repartition(col("year"))\ .write.option("maxRecordsPerFile", 1000000)\ .parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy') When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found. Whats going wrong here, am I missing something? Interestingly, df.write.format('parquet').saveAsTable

Collecting Parquet data from HDFS to local file system

梦想的初衷 提交于 2019-12-11 04:03:31
问题 Given a Parquet dataset distributed on HDFS (metadata file + may .parquet parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ... doesn't work - it merges metadata with actual parquet files.. 回答1: There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist. spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS") spark> parquet.repartition(1)

External Table not getting updated from parquet files written by spark streaming

空扰寡人 提交于 2019-12-11 01:26:33
问题 I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like : CREATE TABLE if not exists rolluptable USING org.apache.spark.sql.parquet OPTIONS ( path "hdfs:////" ); I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up. Dropping and recreating the table every time works fine

Error opening a parquet file on Amazon S3 using pyarrow

房东的猫 提交于 2019-12-11 01:23:56
问题 I have this code, which is supposed to read a single column data from a parquet file stored on S3: fs = s3fs.S3FileSystem() data_set = pq.ParquetDataset(f"s3://{bucket}/{key}", filesystem=fs) column_data = data_set.read(columns=[col_name]) and I get this excption: validate_schemas self.schema = self.pieces[0].get_metadata(open_file).schema IndexError: list index out of range I upgraded to the latest version of pyarrow but it did not help 来源: https://stackoverflow.com/questions/52057964/error

Is it possible to remove files from Spark Streaming folder?

别等时光非礼了梦想. 提交于 2019-12-10 23:58:59
问题 Spark 2.1, ETL process convert files from source systems into parquet and put small parquets in folder1. Spark streaming on folder1 is working OK, but parquet files in folder1 too small for HDFS. We have to merge small parquet files in bigger one, but when I try to remove files from folder1, spark streaming process rise exception: 17/07/26 17:16:23 ERROR StreamExecution: Query [id = f29783ea-bdfb-4b59-a6f6-b77f79509a5a, runId = cbcce2b2-7d7b-4e31-a15a-7efed420f974] terminated with error java

Generating parquet files - differences between R and Python

给你一囗甜甜゛ 提交于 2019-12-10 21:48:44
问题 We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet ) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations? 回答1: (only answering to 1), please post separate questions to make it easier to answer) _metadata

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

天大地大妈咪最大 提交于 2019-12-10 17:53:16
问题 I have a multi-million record SQL table that I'm planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file. However, I can't seem to find an API or parameter with the pyarrow library that allows me to specify something like: file_scheme="hive" As is supported by the fastparquet python library. Here's my sample code: #!/usr/bin/python import pyodbc import pandas as pd import pyarrow as pa import