parquet

Using Spark to write a parquet file to s3 over s3a is very slow

﹥>﹥吖頭↗ 提交于 2019-11-26 19:32:06
问题 I'm trying to write a parquet file out to Amazon S3 using Spark 1.6.1 . The small parquet that I'm generating is ~2GB once written so it's not that much data. I'm trying to prove Spark out as a platform that I can use. Basically what I'm going is setting up a star schema with dataframes , then I'm going to write those tables out to parquet. The data comes in from csv files provided by a vendor and I'm using Spark as an ETL platform. I currently have a 3 node cluster in ec2(r3.2xlarge) So

How do you control the size of the output file?

烈酒焚心 提交于 2019-11-26 18:08:49
问题 In spark, what is the best way to control file size of the output file. For example, in log4j, we can specify max file size, after which the file rotates. I am looking for similar solution for parquet file. Is there a max file size option available when writing a file? I have few workarounds, but none is good. If I want to limit files to 64mb, then One option is to repartition the data and write to temp location. And then merge the files together using the file size in the temp location. But

Spark SQL - loading csv/psv files with some malformed records

不想你离开。 提交于 2019-11-26 17:18:08
问题 We are loading hierarchies of directories of files with Spark and converting them to Parquet. There are tens of gigabytes in hundreds of pipe-separated files. Some are pretty big themselves. Every, say, 100th file has a row or two that has an extra delimiter that makes the whole process (or the file) abort. We are loading using: sqlContext.read .format("com.databricks.spark.csv") .option("header", format("header")) .option("delimiter", format("delimeter")) .option("quote", format("quote"))

Reading parquet files from multiple directories in Pyspark

我只是一个虾纸丫 提交于 2019-11-26 14:39:40
问题 I need to read parquet files from multiple paths that are not parent or child directories. for example, dir1 --- | ------- dir1_1 | ------- dir1_2 dir2 --- | ------- dir2_1 | ------- dir2_2 sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2 Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll Thanks 回答1: A little late but I found

Why are Spark Parquet files for an aggregate larger than the original?

僤鯓⒐⒋嵵緔 提交于 2019-11-26 14:31:55
问题 I am trying to create an aggregate file for end users to utilize to avoid having them process multiple sources with much larger files. To do that I: A) iterate through all source folders, stripping out 12 fields that are most commonly requested, spinning out parquet files in a new location where these results are co-located. B) I try to go back through the files created in step A and re-aggregate them by grouping by the 12 fields to reduce it to a summary row for each unique combination. What

Spark SQL常见4种数据源(详细)

坚强是说给别人听的谎言 提交于 2019-11-25 23:05:22
通用load/write方法 手动指定选项 Spark SQL的DataFrame接口支持多种数据源的操作。一个DataFrame可以进行RDDs方式的操作,也可以被注册为临时表。把DataFrame注册为临时表之后,就可以对该DataFrame执行SQL查询。 Spark SQL的默认数据源为Parquet格式。数据源为Parquet文件时,Spark SQL可以方便的执行所有的操作。 修改配置项 spark.sql.sources.default ,可修改默认数据源格式。 scala> val df = spark.read.load("hdfs://hadoop001:9000/namesAndAges.parquet") df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> df.select("name").write.save("names.parquet") 当数据源格式不是parquet格式文件时,需要手动指定数据源的格式。数据源格式需要指定全名(例如: org.apache.spark.sql.parquet ),如果数据源格式为内置格式,则只需要指定简称json, parquet, jdbc, orc, libsvm, csv, text来指定数据的格式。