parquet

dask dataframe read parquet schema difference

余生颓废 提交于 2019-11-29 12:16:05
I do the following: import dask.dataframe as dd from dask.distributed import Client client = Client() raw_data_df = dd.read_csv('dataset/nyctaxi/nyctaxi/*.csv', assume_missing=True, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']) The dataset is taken out of a presentation Mathew Rocklin has made and was used as a dask dataframe demo. Then I try to write it to parquet using pyarrow raw_data_df.to_parquet(path='dataset/parquet/2015.parquet/') # only pyarrow is installed Trying to read back: raw_data_df = dd.read_parquet(path='dataset/parquet/2015.parquet/') I get the following

Read parquet data from AWS s3 bucket

我的未来我决定 提交于 2019-11-29 09:34:33
I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this: S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey)); InputStream inputStream = object.getObjectContent(); But the apache parquet reader uses only local file like this: ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath())) .withConf(conf) .build(); reader.read() So I don't know how parse input stream for parquet file. For example for csv files there is CSVParser which uses inputstream. I know solution to use spark

Apache Drill has bad performance against SQL Server

十年热恋 提交于 2019-11-29 09:16:53
I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was: SELECT p.Product_Category, SUM(f.sales) FROM facts f JOIN Product p on f.pkey = p.pkey GROUP BY p.Product_Category Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows. First I tested this query on SqlServer and got a result back in about 150ms. With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec). Then I tried saving the tables into json files and reading from them, but that was

Is it better to have one large parquet file or lots of smaller parquet files?

萝らか妹 提交于 2019-11-29 06:55:16
I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the smallest column is 64mb, would it save any computation time over having, say, 1gb files? Aim for around 1GB per file (spark partition) (1). Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2). Using snappy instead of gzip will significantly increase the file size, so if storage space is an

How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)

强颜欢笑 提交于 2019-11-29 05:18:02
I recently had a requirement where I needed to generate Parquet files that could be read by Apache Spark using only Java (Using no additional software installations such as: Apache Drill, Hive, Spark, etc.). The files needed to be saved to S3 so I will be sharing details on how to do both. There were no simple to follow guides on how to do this. I'm also not a Java programmer so the concepts of using Maven, Hadoop, etc. were all foreign to me. So it took me nearly two weeks to get this working. I'd like to share my personal guide below on how I achieved this Disclaimer: The code samples below

how to merge multiple parquet files to single parquet file using linux or hdfs command?

会有一股神秘感。 提交于 2019-11-29 02:54:10
问题 I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file? what is the best way to do it using some hdfs or linux commands ? we used to merge the text files using cat command, but will this work for parquet as well? Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark ? 回答1: According to this https://issues.apache.org/jira/browse/PARQUET-460 Now you

How to view Apache Parquet file in Windows?

落花浮王杯 提交于 2019-11-29 02:29:45
问题 I couldn't find any plain English explanations regarding Apache Parquet files. Such as: What are they? Do I need Hadoop or HDFS to view/create/store them? How can I create parquet files? How can I view parquet files? Any help regarding these questions is appreciated. 回答1: What is Apache Parquet? Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing

How to convert spark SchemaRDD into RDD of my case class?

你说的曾经没有我的故事 提交于 2019-11-29 01:35:05
问题 In the spark docs it's clear how to create parquet files from RDD of your own case classes; (from the docs) val people: RDD[Person] = ??? // An RDD of case class objects, from the previous example. // The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet. people.saveAsParquetFile("people.parquet") But not clear how to convert back, really we want a method readParquetFile where we can do: val people: RDD[Person] = sc.readParquestFile[Person]

Spark lists all leaf node even in partitioned data

人盡茶涼 提交于 2019-11-29 01:33:48
I have parquet data partitioned by date & hour , folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data. query: select * from raw_events where event_date='2016-01-01' similar problem : http://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCAAswR-7Qbd2tdLSsO76zyw9tvs

独家 | 10个数据科学家常犯的编程错误(附解决方案)

不羁岁月 提交于 2019-11-29 00:25:16
简介: 本文为资深数据科学家常见的10个错误提供解决方案。 数据科学家是“比软件工程师更擅长统计学,比统计学家更擅长软件工程的人”。许多数据科学家都具有统计学背景,但是在软件工程方面的经验甚少。我是一名资深数据科学家,在Stackoverflow的python编程方面排名前1%,并与许多(初级)数据科学家共事。以下是我经常看到的10大常见错误,本文将为你相关解决方案: 不共享代码中引用的数据 对无法访问的路径进行硬编码 将代码与数据混合 在Git中和源码一起提交数据 编写函数而不是DAG 写for循环 不编写单元测试 不写代码说明文档 将数据保存为csv或pickle文件 使用jupyter notebook 1. 不共享代码中引用的数据 数据科学需要代码和数据。因此,为了让别人可以复现你的结果,他们需要能够访问到数据。道理很简单,但是很多人忘记分享他们代码中的数据。 import pandas as pd df1 = pd.read_csv('file-i-dont-have.csv') # fails do_stuff(df) 解决方案:使用d6tpipe( https://github.com/d6t/ d6tpipe)来共享你的代码中的数据文件、将其上传到S3/web/google驱动等,或者保存到数据库,以便于别人可以检索到文件(但是不要将其添加到git,原因见下文)。