parquet | 易学教程

Tensorflow Dataset API: input pipeline with parquet files

阅读更多关于 Tensorflow Dataset API: input pipeline with parquet files

问题 I am trying to design an input pipeline with Dataset API. I am working with parquet files. What is a good way to add them to my pipeline? 回答1: We have released Petastorm, an open source library that allows you to use Apache Parquet files directly via Tensorflow Dataset API. Here is a small example: with Reader('hdfs://.../some/hdfs/path') as reader: dataset = make_petastorm_dataset(reader) iterator = dataset.make_one_shot_iterator() tensor = iterator.get_next() with tf.Session() as sess:

Is it better for Spark to select from hive or select from file

阅读更多关于 Is it better for Spark to select from hive or select from file

问题 I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why? Mike 回答1: tl;dr : I would read it straight from the parquet files I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

阅读更多关于 Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

问题 I'm creating an uber jar spark application that I'm spark submitting to an EMR 4.3 cluster, I'm provisioning 4 r3.xlarge instances, one to be the master and the other three as the cores. I have hadoop 2.7.1, ganglia 3.7.2 spark 1.6, and hive 1.0.0 pre-installed from the console. I'm running the following command: spark-submit \ --deploy-mode cluster \ --executor-memory 4g \ --executor-cores 2 \ --num-executors 4 --driver-memory 4g --driver-cores 2 --conf "spark.driver.maxResultSize=2g" --conf

sqoop create impala parquet table

阅读更多关于 sqoop create impala parquet table

问题 I'm relatively new the process of sqooping so pardon any ignorance. I have been trying to sqoop a table from a data source as a parquet file and create an impala table (also as parquet) into which I will insert the sqooped data. The code runs without an issue, but when I try to select a couple rows for testing I get the error: .../EWT_CALL_PROF_DIM_SQOOP/ec2fe2b0-c9fa-4ef9-91f8-46cf0e12e272.parquet' has an incompatible Parquet schema for column 'dru_id.test_ewt_call_prof_dim_parquet.call_prof

AWS Glue Bookmark produces duplicates

阅读更多关于 AWS Glue Bookmark produces duplicates

问题 I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source. These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data. Unfortunately in this scenario I notice instead that each time duplicates are produced and looks

Cross read parquet files between R and Python

阅读更多关于 Cross read parquet files between R and Python

问题 We have generated a parquet files, one in Dask (Python) and another with R Drill (using the Sergeant packet ). They use a different implementations of parquet see my other parquet question We are not able to cross read the files (the python can't read the R file and vice versa). When reading the Python parquet file in the R environment we receive the following error: system error: Illegalstatexception: UTF8 can only annotate binary filed . When reading the R/Drill parquet file in Dask we get

sparklyr spark_read_parquet Reading String Fields as Lists

阅读更多关于 sparklyr spark_read_parquet Reading String Fields as Lists

问题 I have a number of Hive files in parquet format that contain both string and double columns. I can read most of them into a Spark Data Frame with sparklyr using the syntax below: spark_read_parquet(sc, name = "name", path = "path", memory = FALSE) However, I have one file that I read in where all of the string values get converted to unrecognizable lists that looks like this when collected into an R Data Frame and printed: s_df <- spark_read_parquet(sc, name = "s_df", path = "hdfs:/

Effectively merge big parquet files

阅读更多关于 Effectively merge big parquet files

问题 I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided. Thank you. 回答1: I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so

Apache Beam Java SDK SparkRunner write to parquet error

阅读更多关于 Apache Beam Java SDK SparkRunner write to parquet error

问题 I'm using Apache Beam with Java. I'm trying to read a csv file and write it to parquet format using the SparkRunner on a predeployed Spark env, using local mode. Everything worked fine with the DirectRunner, but the SparkRunner simply wont work. I'm using maven shade plugin to build a fat jat. Code is as below: Java: public class ImportCSVToParquet{ -- ommitted File csv = new File(filePath); PCollection<String> vals = pipeline.apply(TextIO.read().from(filePath)); String parquetFilename = csv

Slow or incomplete saveAsParquetFile from EMR Spark to S3

阅读更多关于 Slow or incomplete saveAsParquetFile from EMR Spark to S3

问题 I have a piece of code that creates a DataFrame and persists it to S3. Below creates a DataFrame of 1000 rows and 100 columns, populated by math.Random . I'm running this on a cluster with 4 x r3.8xlarge worker nodes, and configuring plenty of memory. I've tried with the maximum number of executors, and one executor per node. // create some random data for performance and scalability testing val df = sqlContext.range(0,1000).map(x => Row.fromSeq((1 to 100).map(y => math.Random))) df