parquet

Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

你说的曾经没有我的故事 提交于 2019-12-06 22:10:39
I tried to pass class paramiko.sftp_file.SFTPFile instead of file URL for pandas.read_parquet and it worked fine. But when I tried the same with Dask, it threw an error. Below is the code I tried to run and the error I get. How can I make this work? import dask.dataframe as dd import parmiko ssh=paramiko.SSHClient() sftp_client = ssh.open_sftp() ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) source_file=sftp_client.open(str(parquet_file),'rb') full_df = dd.read_parquet(source_file,engine='pyarrow') print(len(full_df)) Traceback (most recent call last): File "C:\Users\rrrrr\Documents

读取外部数据源External Data Source

风格不统一 提交于 2019-12-06 16:20:27
可参考官方文档 http://spark.apache.org/docs/2.2.0/sql-programming-guide.html 怎么读取外部数据 读:spark.read.format(format) 支持数据格式 内置:json、parquet、jdbc、csv(2.x) 外部:可访问 https://spark-packages.org/ ,这里面提供了很多外部数据源 写:people.write.format("parquet").save("path") 操作Parquet文件 这里我在本地代码测试的,将服务器spark目录下文件下载到本地 /home/hadoop/app/spark-2.2.0-bin-hadoop2.6/examples/src/main/resources/users.parquet 代码如下 package com.yy.spark import org.apache.spark.sql.SparkSession /** * 读取parquet文件 */ object ParquetApp extends App { val path = "file:///D:\\data\\users.parquet" var spark = SparkSession.builder().appName("ParquetApp").master(

How to handle TIMESTAMP_MICROS parquet fields in Presto/Athena

柔情痞子 提交于 2019-12-06 16:00:44
Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format. The format for the timestamps in parquet ends up being TIMESTAMP_MICROS. This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision. This does not cause any errors directly but it makes the times display as some extreme future date as it is interpreting the number of microseconds as number of milliseconds. We are currently working around this by

How to deal with large number of parquet files

℡╲_俬逩灬. 提交于 2019-12-06 14:18:06
问题 I'm using Apache Parquet on Hadoop and after a while I have one concern. When I'm generating parquets in Spark on Hadoop it can get pretty messy. When I say messy I mean that Spark job is genearing big amount of parquet files. When I try to query them I'm dealing with big time query because Spark is merging all the files together. Can you show me the right way to deal with it, or I'm maybe missusing them? Have you already dealt with it and how did u resolve it? UPDATE 1: Is some "side job"

Spark issues reading parquet files

怎甘沉沦 提交于 2019-12-06 12:25:13
I have 2 parquet part files part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet (is part file from 2017 Nov 14th run ) and part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (is part file from 2017 Nov 16th run ) and have both having same schema (which I verified by printing schema). My problem is that I have, say 10 columns which is coming properly if I read this 2 files separately using Spark. But if I put this file is folder are try to read together, total count is coming correct (sum of rows from 2 files) but from 2nd file most of the columns are null. Only

How to execute a spark sql query from a map function (Python)?

帅比萌擦擦* 提交于 2019-12-06 10:49:16
问题 How does one execute spark sql queries from routines that are not the driver portion of the program? from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.types import * def doWork(rec): data = SQLContext.sql("select * from zip_data where STATEFP ='{sfp}' and COUNTYFP = '{cfp}' ".format(sfp=rec[0], cfp=rec[1])) for item in data.collect(): print(item) # do something return (rec[0], rec[1]) if __name__ == "__main__": sc = SparkContext(appName="Some app") print(

Hive格式 Parquet与ORC性能测试报告

北城以北 提交于 2019-12-06 09:53:39
一、环境说明 Hadoop集群:使用测试Hadoop集群,节点: hadoop230 hadoop231 hadoop232 hadoop233 这几台机器配置一样,具体参数可参考如下: CPU数量:2个 CPU线程数:32个 内存:128GB 磁盘:48TB 使用测试机群上的同一个队列,使用整个集群的资源,所有的查询都是无并发的。 Hive使用官方的hive 1.2.1版本,使用hiveserver2的方式启动,使用本机的mysql存储元数据。 二、测试数据生成 测试数据为TPC-DS基准测试的数据,官方文档:http://www.tpc.org/information/current_specifications.asp,这个数据集一共24个表:7个事实表,17个维度表,每一个事实表和大部分的维度表组成雪花模型,scale_factor设置为100,也就是生成100GB的数据。 2.1 下载hive-testbench git clone https://github.com/hortonworks/hive-testbench 这个项目是用于生成TPC-DS数据集并且将其导入到hive,在使用之前需要保证已经将hive、hadoop等命令加入到PATH中。 2.2 编译 进入该目录,执行./tpcds-build.sh,该命令会从TPC-DS下载源代码,并编译

Spark Streaming appends to S3 as Parquet format, too many small partitions

心不动则不痛 提交于 2019-12-06 08:47:21
I am building an app that uses Spark Streaming to receive data from Kinesis streams on AWS EMR. One of the goals is to persist the data into S3 (EMRFS), and for this I am using a 2 minutes non-overlapping window. My approaches: Kinesis Stream -> Spark Streaming with batch duration about 60 seconds, using a non-overlapping window of 120s, save the streamed data into S3 as: val rdd1 = kinesisStream.map( rdd => /* decode the data */) rdd1.window(Seconds(120), Seconds(120).foreachRDD { rdd => val spark = SparkSession... import spark.implicits._ // convert rdd to df val df = rdd.toDF(columnNames: _

Avoid losing data type for the partitioned data when writing from Spark

会有一股神秘感。 提交于 2019-12-06 07:02:14
I have a dataframe like below. itemName, itemCategory Name1, C0 Name2, C1 Name3, C0 I would like to save this dataframe as partitioned parquet file: df.write.mode("overwrite").partitionBy("itemCategory").parquet(path) For this dataframe, when I read the data back, it will have String the data type for itemCategory . However at times, I have dataframe from other tenants as below. itemName, itemCategory Name1, 0 Name2, 1 Name3, 0 In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory . Parquet file has the metadata

Unable to get parquet-tools working from the command-line

ⅰ亾dé卋堺 提交于 2019-12-06 04:55:58
问题 I'm attempting to get the newest version of parquet-tools running, but I'm having some issues. For some reason org.apache.hadoop.conf.Configuration isn't in the shaded jar. (I have the same issue with v1.6.0 as well). Is there something beyond mvn package or mvn install that I should be doing? (The actual mvn invocation I'm using is mvn install -DskipTests -pl \!parquet-thrift,\!parquet-cascading,\!parquet-pig-bundle,\!parquet-pig,\!parquet-scrooge,\!parquet-hive,\!parquet-protobuf ). This