parquet

Read few parquet files at the same time in Spark

拟墨画扇 提交于 2019-12-03 13:48:44
问题 I can read few json-files at the same time using * (star): sqlContext.jsonFile('/path/to/dir/*.json') Is there any way to do the same thing for parquet? Star doesn't works. 回答1: See this issue on the spark jira. It is supported from 1.4 onwards. Without upgrading to 1.4, you could either point at the top level directory: sqlContext.parquetFile('/path/to/dir/') which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to

How to convert an 500GB SQL table into Apache Parquet?

断了今生、忘了曾经 提交于 2019-12-03 13:10:10
Perhaps this is well documented, but I am getting very confused how to do this (there are many Apache tools). When I create an SQL table, I create the table using the following commands: CREATE TABLE table_name( column1 datatype, column2 datatype, column3 datatype, ..... columnN datatype, PRIMARY KEY( one or more columns ) ); How does one convert this exist table into Parquet? This file is written to disk? If the original data is several GB, how long does one have to wait? Could I format the original raw data into Parquet format instead? Apache Spark can be used to do this: 1.load your table

How to save a partitioned parquet file in Spark 2.1?

≡放荡痞女 提交于 2019-12-03 12:48:33
I am trying to test how to write data in HDFS 2.7 using Spark 2.1. My data is a simple sequence of dummy values and the output should be partitioned by the attributes: id and key . // Simple case class to cast the data case class SimpleTest(id:String, value1:Int, value2:Float, key:Int) // Actual data to be stored val testData = Seq( SimpleTest("test", 12, 13.5.toFloat, 1), SimpleTest("test", 12, 13.5.toFloat, 2), SimpleTest("test", 12, 13.5.toFloat, 3), SimpleTest("simple", 12, 13.5.toFloat, 1), SimpleTest("simple", 12, 13.5.toFloat, 2), SimpleTest("simple", 12, 13.5.toFloat, 3) ) // Spark's

Parquet without Hadoop?

余生颓废 提交于 2019-12-03 10:10:58
I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency? Investigating the same question I found that apparently it's not possible for the moment. I found this git issue , which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet. In the Apache Jira I found an issue , which asks for a way to read a parquet file outside hadoop. It is unresolved by the time of writing. EDIT: Issues are not tracked on github anymore (first link

Apache Spark Structured Streaming (DataStreamWriter) write to Hive table

匿名 (未验证) 提交于 2019-12-03 10:03:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am looking to use Spark Structured streaming to read data from Kafka and process it and write to Hive table. val spark = SparkSession .builder .appName("Kafka Test") .config("spark.sql.streaming.metricsEnabled", true) .config("spark.streaming.backpressure.enabled", "true") .enableHiveSupport() .getOrCreate() val events = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "xxxxxxx") .option("startingOffsets", "latest") .option("subscribe", "yyyyyy") .load val data = events.select(.....some columns...) data.writeStream

Spark 1.6 on EMR writing to S3 as Parquet hangs and fails

有些话、适合烂在心里 提交于 2019-12-03 09:14:12
I'm creating an uber jar spark application that I'm spark submitting to an EMR 4.3 cluster, I'm provisioning 4 r3.xlarge instances, one to be the master and the other three as the cores. I have hadoop 2.7.1, ganglia 3.7.2 spark 1.6, and hive 1.0.0 pre-installed from the console. I'm running the following command: spark-submit \ --deploy-mode cluster \ --executor-memory 4g \ --executor-cores 2 \ --num-executors 4 --driver-memory 4g --driver-cores 2 --conf "spark.driver.maxResultSize=2g" --conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet

How to deal with tasks running too long (comparing to others in job) in yarn-client?

久未见 提交于 2019-12-03 09:08:52
问题 We use a Spark cluster as yarn-client to calculate several business, but sometimes we have a task run too long time: We don't set timeout but I think default timeout a spark task is not too long such here ( 1.7h ). Anyone give me an ideal to work around this issue ??? 回答1: There is no way for spark to kill its tasks if its taking too long. But I figured out a way to handle this using speculation, This means if one or more tasks are running slowly in a stage, they will be re-launched. spark

Apache Spark Parquet: Cannot build an empty group

匿名 (未验证) 提交于 2019-12-03 08:59:04
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I use Apache Spark 2.1.1 (used 2.1.0 and it was the same, switched today). I have a dataset: root |-- muons: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- reco::Candidate: struct (nullable = true) | | |-- qx3_: integer (nullable = true) | | |-- pt_: float (nullable = true) | | |-- eta_: float (nullable = true) | | |-- phi_: float (nullable = true) | | |-- mass_: float (nullable = true) | | |-- vertex_: struct (nullable = true) | | | |-- fCoordinates: struct (nullable = true) | | | | |-- fX: float (nullable =

Reading/writing with Avro schemas AND Parquet format in SparkSQL

匿名 (未验证) 提交于 2019-12-03 08:56:10
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads. My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's). I can't for the life of me figure out how to

Why is parquet slower for me against text file format in hive?

匿名 (未验证) 提交于 2019-12-03 08:52:47
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A] Table C Format - Parquet with snappy