parquet

Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

徘徊边缘 提交于 2019-12-08 08:10:38
问题 I tried to pass class paramiko.sftp_file.SFTPFile instead of file URL for pandas.read_parquet and it worked fine. But when I tried the same with Dask, it threw an error. Below is the code I tried to run and the error I get. How can I make this work? import dask.dataframe as dd import parmiko ssh=paramiko.SSHClient() sftp_client = ssh.open_sftp() ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) source_file=sftp_client.open(str(parquet_file),'rb') full_df = dd.read_parquet(source_file

How to handle TIMESTAMP_MICROS parquet fields in Presto/Athena

帅比萌擦擦* 提交于 2019-12-08 03:56:17
问题 Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format. The format for the timestamps in parquet ends up being TIMESTAMP_MICROS. This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision. This does not cause any errors directly but it makes the times display as some extreme future date as it is

Streaming parquet file python and only downsampling

回眸只為那壹抹淺笑 提交于 2019-12-07 16:07:25
问题 I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be

What controls the number of partitions when reading Parquet files?

筅森魡賤 提交于 2019-12-07 13:51:43
问题 My setup: Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1. The EMR cluster was installed with emr-bootstrap-actions. The EC2 cluster was installed with Spark's default EC2 scripts. The code: Read a folder containing 12 Parquet files and count the number of partitions val logs = sqlContext.parquetFile(“s3n://mylogs/”) logs.rdd.partitions.length Observations: On EC2 this code gives me 12 partitions (one per file, makes sense). On EMR this code gives me 138 (!)

Design of Spark + Parquet “database”

谁说胖子不能爱 提交于 2019-12-07 09:38:52
问题 I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks. Assuming I want to use Spark SQL and parquet, what's the best way to achieve this? give up on concurrent reads/writes and append new data to the existing parquet file. create a new parquet file for each day of data, and use

Spark's int96 time type

纵饮孤独 提交于 2019-12-07 05:17:05
问题 When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day. This does not conform to any parquet logical type. The schema in the parquet file does not, then, give an indication of the column being anything but an integer. My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer? 回答1: Semantics is

Why is parquet slower for me against text file format in hive?

我们两清 提交于 2019-12-07 05:12:20
问题 OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A]

Apache Spark Parquet: Cannot build an empty group

五迷三道 提交于 2019-12-07 03:59:26
问题 I use Apache Spark 2.1.1 (used 2.1.0 and it was the same, switched today). I have a dataset: root |-- muons: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- reco::Candidate: struct (nullable = true) | | |-- qx3_: integer (nullable = true) | | |-- pt_: float (nullable = true) | | |-- eta_: float (nullable = true) | | |-- phi_: float (nullable = true) | | |-- mass_: float (nullable = true) | | |-- vertex_: struct (nullable = true) | | | |-- fCoordinates: struct

Spark on embedded mode - user/hive/warehouse not found

允我心安 提交于 2019-12-07 01:59:28
问题 I'm using Apache Spark in embedded local mode. I have all the dependencies included in my pom.xml and in the same version (spark-core_2.10, spark-sql_2.10, and spark-hive_2.10). I just want to run a HiveQL query to create a table (stored as Parquet). Running the following (rather simple) code: public class App { public static void main(String[] args) throws IOException, ClassNotFoundException { SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL").setMaster("local[2]").set("spark

Build failure - Apache Parquet-MR source (mvn install failure)

那年仲夏 提交于 2019-12-07 01:06:16
问题 I am getting following error while trying to execute "mvn clean install" for building parquet-mr source obtained from https://github.com/apache/parquet-mr [INFO] Storing buildScmBranch: UNKNOWN [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ parquet-generator --- [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Parquet MR ................................. SUCCESS [1.494s] [INFO] Apache Parquet