parquet | 易学教程

Hadoop-MapReduce+HDFS文件格式和压缩格式+split和Maptask关系+WordCount剖析+shuffle理解

阅读更多关于 Hadoop-MapReduce+HDFS文件格式和压缩格式+split和Maptask关系+WordCount剖析+shuffle理解

一. MapReduce on Yarn流程 1. 什么是MapReduce MapReduce是一个计算框架，核心思想是"分而治之"，表现形式是有个输入(input)，mapreduce操作这个输入(input)，通过本身定义好的计算模型，得到一个输出(output)，这个输出就是我们所需要的结果。在运行一个mapreduce计算任务时候，任务过程被分为两个阶段：map阶段和reduce阶段，每个阶段都是用键值对(key/value)作为输入(input)和输出(output)。而程序员要做的就是定义好这两个阶段的函数：map函数和reduce函数。 Map：映射过程，把一组数据按照某种Map函数映射成新的数据。每一行解析成一个 <k，v> 键值对。每一个键值对调用一次map函数，生成一个新的 <k，v> 键值对 Shuffle：洗牌，对数据映射的排序、分组、拷贝。 Reduce：归约过程，把若干组映射结果进行汇总并输出。 2. Yarn的作用 Yarn：ResourceManager，NodeManager RM：application Manager 应用程序管理器 resource scheduler 资源memory+cpu调度器 ResourceManager ：负责资源管理。在运行过程中，整个系统有且只有一个RM，系统的资源由RM来负责调度管理

Dataframe state before save and after load - what's different?

阅读更多关于 Dataframe state before save and after load - what's different?

问题 I have a DF that contains some SQL expressions (coalesce, case/when etc.). I later try to map/flatMap this DF where I get an Task not serializable error, due to the fields that contain the SQL expressions. (Why I need to map/flatMap this DF is a separate question) When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem! How is the DF different before saving and after loading? In some way, the SQL expressions

MapReduce Job to Collect All Unique Fields in HDFS Directory of JSON

阅读更多关于 MapReduce Job to Collect All Unique Fields in HDFS Directory of JSON

问题 My question is in essence the application of this referenced question: Convert JSON to Parquet I find myself in the rather unique position of having to semi-manually curate an Avro schema for the superset of fields contained in JSON files (composed of arbitrary combinations of known resources)in an HDFS directory. This is part of an ETL pipeline I am trying to develop to convert these files to parquet for much more efficient/easier processing in Spark. I have never written a MapReduce program

Sqoop + S3 + Parquet results in Wrong FS error

阅读更多关于 Sqoop + S3 + Parquet results in Wrong FS error

问题 When trying to import data to S3 in Parquet format using Sqoop, as follows: bin/sqoop import --connect 'jdbc:[conn_string]' --table [table] --target-dir s3a://bucket-name/ --hive-drop-import-delims --as-parquetfile ... I get the following error: ERROR tool.ImportTool: Imported Failed: Wrong FS: s3a://bucket-name/, expected: hdfs://localhost:9000 I have no problem importing non-parquet data or working with s3a directly through HDFS. Seems like this issue, but it was supposedly fixed many

How to handle keys in Json with special characters in spark parquet?

阅读更多关于 How to handle keys in Json with special characters in spark parquet?

问题 I am trying to create data frame from json in parquet format. I am getting following exception, Exception in thread "main" org.apache.spark.sql.AnalysisException: Attribute name "d?G?@4???[[l?~?N!^w1 ?X!8??ingSuccessful" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; I know that some json key having special characters is a reason for above exception. However, I do not know how many keys have special characters. Also, one possible solution is to replace

Print parquet schema using Spark Streaming

阅读更多关于 Print parquet schema using Spark Streaming

问题 Following is the extract of the scala code written to extract praquet files and print the schema and first few records from the Parquet file. But nothing is getting printed. val batchDuration = 2 val inputDir = "file:///home/samplefiles" val conf = new SparkConf().setAppName("gpParquetStreaming").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("spark.streaming.fileStream.minRememberDuration", "600000") val ssc = new StreamingContext(sc, Seconds(batchDuration))

Spark Sql - Insert Into External Hive Table Error

阅读更多关于 Spark Sql - Insert Into External Hive Table Error

问题 I am trying to insert data into a external hive table through spark sql. My hive table is bucketed via a column. The query to create the external hive table is this create external table tab1 ( col1 type,col2 type,col3 type) clustered by (col1,col2) sorted by (col1) into 8 buckets stored as parquet Now I tried to store data from a parquet file (stored in hdfs) into the table. This is my code SparkSession session = SparkSession.builder().appName("ParquetReadWrite"). config("hive.exec.dynamic

s3 parquet write - too many partitions, slow writing

阅读更多关于 s3 parquet write - too many partitions, slow writing

问题 I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running. output.write.mode("append").partitionBy("id")

Firehose JSON -> S3 Parquet -> ETL Spark, error: Unable to infer schema for Parquet

阅读更多关于 Firehose JSON -> S3 Parquet -> ETL Spark, error: Unable to infer schema for Parquet

问题 It seems like this should be easy, like it's a core use case of this set of features, but it's been problem after problem. The latest is in trying to run commands via a Glue Dev endpoint (both the PySpark and Scala end-points). Following the instructions here: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import * glueContext = GlueContext

Specify Parquet properties pyspark

阅读更多关于 Specify Parquet properties pyspark

问题 How to specify Parquet Block Size and Page Size in PySpark? I have searched everywhere but cannot find any documentation for the function calls or the import libraries. 回答1: According to spark-user archives sc.hadoopConfiguration.setInt("dfs.blocksize", some_value) sc.hadoopConfiguration.setInt("parquet.block.size", some_value) so in PySpark sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", some_value) sc._jsc.hadoopConfiguration().setInt("parquet.block.size", some_value) 来源： https:/