partitioning | 易学教程

Spark lists all leaf node even in partitioned data

阅读更多关于 Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour , folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data. query: select * from raw_events where event_date='2016-01-01' similar problem : http://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCAAswR-7Qbd2tdLSsO76zyw9tvs

Table partitioning using 2 columns

阅读更多关于 Table partitioning using 2 columns

Is it possible to partition a table using 2 columns instead of only 1 for the partition function? Consider a table with 3 columns ID (int, primary key, Date (datetime), Num (int) I want to partition this table by 2 columns: Date and Num . This is what I do to partition a table using 1 column (date): create PARTITION FUNCTION PFN_MonthRange (datetime) AS RANGE left FOR VALUES ('2009-11-30 23:59:59:997', '2009-12-31 23:59:59:997', '2010-01-31 23:59:59:997', '2010-28-02 23:59:59:997', '2010-03-31 23:59:59:997') go Bad News: The partition function has to be defined on a single column. Good News:

Partitioning data set in r based on multiple classes of observations

阅读更多关于 Partitioning data set in r based on multiple classes of observations

问题 I'm trying to partition a data set that I have in R, 2/3 for training and 1/3 for testing. I have one classification variable, and seven numerical variables. Each observation is classified as either A, B, C, or D. For simplicity's sake, let's say that the classification variable, cl, is A for the first 100 observations, B for observations 101 to 200, C till 300, and D till 400. I'm trying to get a partition that has 2/3 of the observations for each of A, B, C, and D (as opposed to simply

Spark SQL saveAsTable is not compatible with Hive when partition is specified

阅读更多关于 Spark SQL saveAsTable is not compatible with Hive when partition is specified

Kind of edge case, when saving parquet table in Spark SQL with partition, #schema definitioin final StructType schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("time", DataTypes.StringType, true), DataTypes.createStructField("accountId", DataTypes.StringType, true), ... DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD); df.coalesce(1) .write() .mode(SaveMode.Append) .format("parquet") .partitionBy("year") .saveAsTable("tblclick8partitioned"); Spark warns: Persisting partitioned data source relation into Hive metastore in Spark SQL specific

Spark: save DataFrame partitioned by “virtual” column

阅读更多关于 Spark: save DataFrame partitioned by “virtual” column

问题 I'm using PySpark to do classic ETL job (load dataset, process it, save it) and want to save my Dataframe as files/directory partitioned by a "virtual" column; what I mean by "virtual" is that I have a column Timestamp which is a string containing an ISO 8601 encoded date, and I'd want to partition by Year / Month / Day; but I don't actually have either a Year, Month or Day column in the DataFrame; I have this Timestamp from which I can derive these columns though, but I don't want my

Apache Spark: Get number of records per partition

阅读更多关于 Apache Spark: Get number of records per partition

I want to check how can we get information about each partition such as total no. of records in each partition on driver side when Spark job is submitted with deploy mode as a yarn cluster in order to log or print on the console. You can get the number of records per partition like this : df .rdd .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))} .toDF("partition_number","number_of_records") .show But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records). Spark could may also read hive table statistics, but I don't know

Database - Designing an “Events” Table

阅读更多关于 Database - Designing an “Events” Table

After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea: I've four types of users modeled in a Class Table Inheritance structure, in the main "user" table I store data common to all the users ( id , username , password , several flags , ...

How to partition Mysql across MULTIPLE SERVERS?

阅读更多关于 How to partition Mysql across MULTIPLE SERVERS?

I know that horizontal partitioning...you can create many tables. How can you do this with multiple servers? This will allow Mysql to scale. Create X tables on X servers? Does anyone care to explain, or have a good beginner's tutorial (step-by-step) that teaches you how to partition across multiple servers? With MySQL, people generally do what is called application based sharding . In a nutshell, you will have the same database structure on multiple database servers. But it won't contain the same data. So for example: Users 1 - 10000: server A Users 10001 - 20000: server B Sharding (of course)

Is Zookeeper a must for Kafka?

阅读更多关于 Is Zookeeper a must for Kafka?

In Kafka, I would like to use only a single broker, single topic and a single partition having one producer and multiple consumers (each consumer getting its own copy of data from the broker). Given this, I do not want the overhead of using Zookeeper; Can I not just use the broker only? Why is a Zookeeper must? John Petrone Yes, Zookeeper is required for running Kafka. From the Kafka Getting Started documentation: Step 2: Start the server Kafka uses zookeeper so you need to first start a zookeeper server if you don't already have one. You can use the convenience script packaged with kafka to

what is a good way to horizontal shard in postgresql

阅读更多关于 what is a good way to horizontal shard in postgresql

问题 what is a good way to horizontal shard in postgresql 1. pgpool 2 2. gridsql which is a better way to use sharding also is it possible to paritition without changing client code It would be great if some one can share a simple tutorial or cookbook example of how to setup and use sharding 回答1: PostgreSQL allows partitioning in two different ways. One is by range and the other is by list. Both use table inheritance to do partition. Partitioning by range, usually a date range, is the most common,