parquet

How to read a file using sparkstreaming and write to a simple file using Scala?

限于喜欢 提交于 2019-12-25 09:18:35
问题 I'm trying to read a file using a scala SparkStreaming program. The file is stored in a directory on my local machine and trying to write it as a new file on my local machine itself. But whenever I write my stream and store it as parquet I end up getting blank folders. This is my code : Logger.getLogger("org").setLevel(Level.ERROR) val spark = SparkSession .builder() .master("local[*]") .appName("StreamAFile") .config("spark.sql.warehouse.dir", "file:///C:/temp") .getOrCreate() import spark

Merge two parquet files in HDFS

爷,独闯天下 提交于 2019-12-25 08:37:18
问题 I have some files in HDFS in parquet format. I would like to merge these files into one single large file. How can I do that? I have done some thing like below but for text files. hadoop fs -cat /input_hdfs_dir/* | hadoop fs -put - /output_hdfs_file But unable to achieve the desired result in parquet format. How can I achieve my requirement? 回答1: Its not possible to merge parquet files with hdfs commands. There is a parquet-tools library that can help you achieve the merging of parquet files.

Spark parquet nested value flatten

☆樱花仙子☆ 提交于 2019-12-25 06:49:56
问题 I have parquet file. I loaded using Spark.And one of the value is nested key,value pairs. How do I flatten? df.printSchema root |-- location: string (nullable = true) |-- properties: string (nullable = true) texas,{"key":{"key1":"value1","key2":"value2"}} thanks, 回答1: You can use explode on your dataframe and pass it a function that reads the JSON column using scala4s. Scala4s has easy parsing API, for your case it will look like: val list = for { JArray(keys) <- parse(json) \\ "key" json @

How do I read a gzipped parquet file from S3 into Python using Boto3?

微笑、不失礼 提交于 2019-12-25 01:43:50
问题 I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3. 回答1: The solution is actually quite straightforward. import boto3 # For read+push to S3 bucket import pandas as pd # Reading parquets from io import BytesIO # Converting bytes to bytes input file import pyarrow # Fast reading of

Spark sql 简单使用

对着背影说爱祢 提交于 2019-12-24 15:32:07
一、认识Spark sql 1、什么是Sparksql? spark sql是spark的一个模块,主要用于进行结构化数据的处理,它提供的最核心抽象就是DataFrame。 2、SparkSQL的作用? 提供一个编程抽象(DataFrame),并且作为分布式SQL查询引擎 DataFrame:它可以根据很多源进行构建,包括:结构化的数据文件、hive中的表,外部的关系型数据库、以及RDD 3、运行原理 将SparkSQL转化为RDD,然后提交到集群执行 4、特点 容易整合、统一的数据访问方式、兼容Hive、标准的数据连接 5、SparkSession SparkSession是Spark 2.0引如的新概念。SparkSession为用户提供了统一的切入点,来让用户学习spark的各项功能。   在spark的早期版本中,SparkContext是spark的主要切入点,由于RDD是主要的API,我们通过sparkcontext来创建和操作RDD。对于每个其他的API,我们需要使用不同的context。例如,对于Streming,我们需要使用StreamingContext;对于sql,使用sqlContext;对于Hive,使用hiveContext。但是随着DataSet和DataFrame的API逐渐成为标准的API,就需要为他们建立接入点。所以在spark2.0中

Spark(十二)SparkSQL简单使用

谁说胖子不能爱 提交于 2019-12-24 15:31:49
一、SparkSQL的进化之路 1.0以前: Shark 1.1.x开始:SparkSQL(只是测试性的) SQL 1.3.x: SparkSQL(正式版本)+Dataframe 1.5.x: SparkSQL 钨丝计划 1.6.x: SparkSQL+DataFrame+DataSet(测试版本) 2.x: SparkSQL+DataFrame+DataSet(正式版本) SparkSQL:还有其他的优化 StructuredStreaming(DataSet) Spark on Hive和Hive on Spark Spark on Hive: Hive只作为储存角色, Spark负责sql解析优化,执行。 Hive on Spark: Hive 即作为存储又 负责sql的解析优化,Spark负责执行。 二、认识SparkSQL 2.1 什么是SparkSQL? spark SQL是spark的一个模块,主要用于进行结构化数据的处理。它提供的最核心的编程抽象就是DataFrame。 2.2 SparkSQL的作用 提供一个编程抽象(DataFrame) 并且作为分布式 SQL 查询引擎 DataFrame:它可以根据很多源进行构建,包括: 结构化的数据文件,hive中的表,外部的关系型数据库,以及RDD 2.3 运行原理 将 Spark SQL 转化为 RDD,

Spark SQL数据源

若如初见. 提交于 2019-12-24 15:31:08
目录 背景 数据源 SparkSession parquet csv json jdbc table 准备table 读取 写入 连接一个已存在的Hive text 格式提前确定 格式在运行时确定 总结 背景 Spark SQL是Spark的一个模块,用于结构化数据的处理。 ++++++++++++++ +++++++++++++++++++++ | SQL | | Dataset API | ++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++++++++++++++++++ | Spark SQL | +++++++++++++++++++++++++++++++++++++ 使用Spark SQL的方式有2种,可以通过SQL或者Dataset API,这两种使用方式在本文都会涉及。 其中,通过SQL接口使用的方法具体又可分为3种: 在程序中执行 使用命令行 Jdbc/ODBC 这里只会介绍第一种方式。 Spark关于分布式数据集的抽象原本是RDD,Dataset是其升级版本。DataFrame是特殊的Dataset,它限定元素是按照命名的列来组织的,从这一点看相当于关系型数据库中的表。DataFrame等价于Dataset[Row],而且DataFrame是本文内容的核心。 DataFrame支持丰富的数据源:

How to write parquet file in partition in java similar to pyspark?

梦想与她 提交于 2019-12-24 05:58:59
问题 I can write parquet file into partition in pyspark like this: rdd.write .partitionBy("created_year", "created_month") .parquet("hdfs:///my_file") The parquet file is auto partition into created_year, created_month. How to do the same in java? I don't see an option in ParquetWriter class. Is there another class that can do that? Thanks, 回答1: You have to convert your RDD into DataFrame and then call write parquet function. df = sql_context.createDataFrame(rdd) df.write.parquet("hdfs:///my_file"

How can I open a .snappy.parquet file in python?

我与影子孤独终老i 提交于 2019-12-24 03:44:19
问题 How can I open a .snappy.parquet file in python 3.5? So far, I used this code: import numpy import pyarrow filename = "/Users/T/Desktop/data.snappy.parquet" df = pyarrow.parquet.read_table(filename).to_pandas() But, it gives this error: AttributeError: module 'pyarrow' has no attribute 'compat' P.S. I installed pyarrow this way: pip install pyarrow 回答1: The error AttributeError: module 'pyarrow' has no attribute 'compat' is sadly a bit misleading. To execute the to_pandas() function on a

Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL

烂漫一生 提交于 2019-12-23 20:41:46
问题 I am trying to write save Spark SQL tables to Parquet files. Because of other issues I need to reduce the number of partitions before writing. My code is data.coalesce(1000,shuffle=true).saveAsParquetFile("s3n://...") This throws java.lang.NullPointerException at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:927) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun