parquet | 易学教程

How to read a file using sparkstreaming and write to a simple file using Scala?

阅读更多关于 How to read a file using sparkstreaming and write to a simple file using Scala?

问题 I'm trying to read a file using a scala SparkStreaming program. The file is stored in a directory on my local machine and trying to write it as a new file on my local machine itself. But whenever I write my stream and store it as parquet I end up getting blank folders. This is my code : Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder() .master("local[*]") .appName("StreamAFile") .config("spark.sql.warehouse.dir", "file:///C:/temp") .getOrCreate() import spark

Merge two parquet files in HDFS

阅读更多关于 Merge two parquet files in HDFS

问题 I have some files in HDFS in parquet format. I would like to merge these files into one single large file. How can I do that? I have done some thing like below but for text files. hadoop fs -cat /input_hdfs_dir/* | hadoop fs -put - /output_hdfs_file But unable to achieve the desired result in parquet format. How can I achieve my requirement? 回答1: Its not possible to merge parquet files with hdfs commands. There is a parquet-tools library that can help you achieve the merging of parquet files.

Spark parquet nested value flatten

阅读更多关于 Spark parquet nested value flatten

问题 I have parquet file. I loaded using Spark.And one of the value is nested key,value pairs. How do I flatten? df.printSchema root |-- location: string (nullable = true) |-- properties: string (nullable = true) texas,{"key":{"key1":"value1","key2":"value2"}} thanks, 回答1: You can use explode on your dataframe and pass it a function that reads the JSON column using scala4s. Scala4s has easy parsing API, for your case it will look like: val list = for { JArray(keys) <- parse(json) \\ "key" json @

How do I read a gzipped parquet file from S3 into Python using Boto3?

阅读更多关于 How do I read a gzipped parquet file from S3 into Python using Boto3?

问题 I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3. 回答1: The solution is actually quite straightforward. import boto3 # For read+push to S3 bucket import pandas as pd # Reading parquets from io import BytesIO # Converting bytes to bytes input file import pyarrow # Fast reading of

Spark sql 简单使用

阅读更多关于 Spark sql 简单使用

一、认识Spark sql 1、什么是Sparksql? spark sql是spark的一个模块，主要用于进行结构化数据的处理，它提供的最核心抽象就是DataFrame。 2、SparkSQL的作用？提供一个编程抽象（DataFrame），并且作为分布式SQL查询引擎 DataFrame：它可以根据很多源进行构建，包括：结构化的数据文件、hive中的表，外部的关系型数据库、以及RDD 3、运行原理将SparkSQL转化为RDD，然后提交到集群执行 4、特点容易整合、统一的数据访问方式、兼容Hive、标准的数据连接 5、SparkSession SparkSession是Spark 2.0引如的新概念。SparkSession为用户提供了统一的切入点，来让用户学习spark的各项功能。在spark的早期版本中，SparkContext是spark的主要切入点，由于RDD是主要的API，我们通过sparkcontext来创建和操作RDD。对于每个其他的API，我们需要使用不同的context。例如，对于Streming，我们需要使用StreamingContext；对于sql，使用sqlContext；对于Hive，使用hiveContext。但是随着DataSet和DataFrame的API逐渐成为标准的API，就需要为他们建立接入点。所以在spark2.0中

Spark（十二）SparkSQL简单使用

阅读更多关于 Spark（十二）SparkSQL简单使用

一、SparkSQL的进化之路 1.0以前： Shark 1.1.x开始：SparkSQL(只是测试性的) SQL 1.3.x: SparkSQL(正式版本)+Dataframe 1.5.x: SparkSQL 钨丝计划 1.6.x： SparkSQL+DataFrame+DataSet(测试版本) 2.x: SparkSQL+DataFrame+DataSet(正式版本) SparkSQL:还有其他的优化 StructuredStreaming(DataSet) Spark on Hive和Hive on Spark Spark on Hive： Hive只作为储存角色， Spark负责sql解析优化，执行。 Hive on Spark： Hive 即作为存储又负责sql的解析优化，Spark负责执行。二、认识SparkSQL 2.1　什么是SparkSQL? spark SQL是spark的一个模块，主要用于进行结构化数据的处理。它提供的最核心的编程抽象就是DataFrame。 2.2　SparkSQL的作用提供一个编程抽象（DataFrame）并且作为分布式 SQL 查询引擎 DataFrame：它可以根据很多源进行构建，包括：结构化的数据文件，hive中的表，外部的关系型数据库，以及RDD 2.3　运行原理将 Spark SQL 转化为 RDD，

Spark SQL数据源

阅读更多关于 Spark SQL数据源

目录背景数据源 SparkSession parquet csv json jdbc table 准备table 读取写入连接一个已存在的Hive text 格式提前确定格式在运行时确定总结背景 Spark SQL是Spark的一个模块，用于结构化数据的处理。 ++++++++++++++ +++++++++++++++++++++ | SQL | | Dataset API | ++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++ | Spark SQL | +++++++++++++++++++++++++++++++++++++ 使用Spark SQL的方式有2种，可以通过SQL或者Dataset API，这两种使用方式在本文都会涉及。其中，通过SQL接口使用的方法具体又可分为3种：在程序中执行使用命令行 Jdbc/ODBC 这里只会介绍第一种方式。 Spark关于分布式数据集的抽象原本是RDD，Dataset是其升级版本。DataFrame是特殊的Dataset，它限定元素是按照命名的列来组织的，从这一点看相当于关系型数据库中的表。DataFrame等价于Dataset[Row]，而且DataFrame是本文内容的核心。 DataFrame支持丰富的数据源：

How to write parquet file in partition in java similar to pyspark?

阅读更多关于 How to write parquet file in partition in java similar to pyspark?

问题 I can write parquet file into partition in pyspark like this: rdd.write .partitionBy("created_year", "created_month") .parquet("hdfs:///my_file") The parquet file is auto partition into created_year, created_month. How to do the same in java? I don't see an option in ParquetWriter class. Is there another class that can do that? Thanks, 回答1: You have to convert your RDD into DataFrame and then call write parquet function. df = sql_context.createDataFrame(rdd) df.write.parquet("hdfs:///my_file"

How can I open a .snappy.parquet file in python?

阅读更多关于 How can I open a .snappy.parquet file in python?

问题 How can I open a .snappy.parquet file in python 3.5? So far, I used this code: import numpy import pyarrow filename = "/Users/T/Desktop/data.snappy.parquet" df = pyarrow.parquet.read_table(filename).to_pandas() But, it gives this error: AttributeError: module 'pyarrow' has no attribute 'compat' P.S. I installed pyarrow this way: pip install pyarrow 回答1: The error AttributeError: module 'pyarrow' has no attribute 'compat' is sadly a bit misleading. To execute the to_pandas() function on a

Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL

阅读更多关于 Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL

问题 I am trying to write save Spark SQL tables to Parquet files. Because of other issues I need to reduce the number of partitions before writing. My code is data.coalesce(1000,shuffle=true).saveAsParquetFile("s3n://...") This throws java.lang.NullPointerException at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:927) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun

订阅 parquet