parquet

Tensorflow Dataset API: input pipeline with parquet files

三世轮回 提交于 2019-12-12 11:18:27
问题 I am trying to design an input pipeline with Dataset API. I am working with parquet files. What is a good way to add them to my pipeline? 回答1: We have released Petastorm, an open source library that allows you to use Apache Parquet files directly via Tensorflow Dataset API. Here is a small example: with Reader('hdfs://.../some/hdfs/path') as reader: dataset = make_petastorm_dataset(reader) iterator = dataset.make_one_shot_iterator() tensor = iterator.get_next() with tf.Session() as sess:

Is it better for Spark to select from hive or select from file

无人久伴 提交于 2019-12-12 08:42:03
问题 I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why? Mike 回答1: tl;dr : I would read it straight from the parquet files I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are

Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

拥有回忆 提交于 2019-12-12 07:38:06
问题 I'm creating an uber jar spark application that I'm spark submitting to an EMR 4.3 cluster, I'm provisioning 4 r3.xlarge instances, one to be the master and the other three as the cores. I have hadoop 2.7.1, ganglia 3.7.2 spark 1.6, and hive 1.0.0 pre-installed from the console. I'm running the following command: spark-submit \ --deploy-mode cluster \ --executor-memory 4g \ --executor-cores 2 \ --num-executors 4 --driver-memory 4g --driver-cores 2 --conf "spark.driver.maxResultSize=2g" --conf

sqoop create impala parquet table

情到浓时终转凉″ 提交于 2019-12-12 03:56:36
问题 I'm relatively new the process of sqooping so pardon any ignorance. I have been trying to sqoop a table from a data source as a parquet file and create an impala table (also as parquet) into which I will insert the sqooped data. The code runs without an issue, but when I try to select a couple rows for testing I get the error: .../EWT_CALL_PROF_DIM_SQOOP/ec2fe2b0-c9fa-4ef9-91f8-46cf0e12e272.parquet' has an incompatible Parquet schema for column 'dru_id.test_ewt_call_prof_dim_parquet.call_prof

AWS Glue Bookmark produces duplicates

心不动则不痛 提交于 2019-12-11 18:33:16
问题 I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source. These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data. Unfortunately in this scenario I notice instead that each time duplicates are produced and looks

Cross read parquet files between R and Python

自闭症网瘾萝莉.ら 提交于 2019-12-11 17:06:23
问题 We have generated a parquet files, one in Dask (Python) and another with R Drill (using the Sergeant packet ). They use a different implementations of parquet see my other parquet question We are not able to cross read the files (the python can't read the R file and vice versa). When reading the Python parquet file in the R environment we receive the following error: system error: Illegalstatexception: UTF8 can only annotate binary filed . When reading the R/Drill parquet file in Dask we get

sparklyr spark_read_parquet Reading String Fields as Lists

北城以北 提交于 2019-12-11 17:01:00
问题 I have a number of Hive files in parquet format that contain both string and double columns. I can read most of them into a Spark Data Frame with sparklyr using the syntax below: spark_read_parquet(sc, name = "name", path = "path", memory = FALSE) However, I have one file that I read in where all of the string values get converted to unrecognizable lists that looks like this when collected into an R Data Frame and printed: s_df <- spark_read_parquet(sc, name = "s_df", path = "hdfs:/

Effectively merge big parquet files

早过忘川 提交于 2019-12-11 16:05:18
问题 I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided. Thank you. 回答1: I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so

Apache Beam Java SDK SparkRunner write to parquet error

你说的曾经没有我的故事 提交于 2019-12-11 15:45:57
问题 I'm using Apache Beam with Java. I'm trying to read a csv file and write it to parquet format using the SparkRunner on a predeployed Spark env, using local mode. Everything worked fine with the DirectRunner, but the SparkRunner simply wont work. I'm using maven shade plugin to build a fat jat. Code is as below: Java: public class ImportCSVToParquet{ -- ommitted File csv = new File(filePath); PCollection<String> vals = pipeline.apply(TextIO.read().from(filePath)); String parquetFilename = csv

Slow or incomplete saveAsParquetFile from EMR Spark to S3

若如初见. 提交于 2019-12-11 12:18:54
问题 I have a piece of code that creates a DataFrame and persists it to S3. Below creates a DataFrame of 1000 rows and 100 columns, populated by math.Random . I'm running this on a cluster with 4 x r3.8xlarge worker nodes, and configuring plenty of memory. I've tried with the maximum number of executors, and one executor per node. // create some random data for performance and scalability testing val df = sqlContext.range(0,1000).map(x => Row.fromSeq((1 to 100).map(y => math.Random))) df