partitioning | 易学教程

In Apache Spark, why does RDD.union not preserve the partitioner?

阅读更多关于 In Apache Spark, why does RDD.union not preserve the partitioner?

As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) .partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd1.cogroup(rdd2) println("cogrouped: " + cogrouped.partitioner) val unioned = rdd1.union(rdd2) println("union: " + unioned.partitioner) I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always

Database - Designing an “Events” Table

阅读更多关于 Database - Designing an “Events” Table

问题 After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea: I've four types of users modeled in a Class Table Inheritance structure, in the

Why partitions elimination does not happen for this query?

阅读更多关于 Why partitions elimination does not happen for this query?

问题 I have a hive table which is partitioned by year, month, day and hour. I need to run a query against it to fetch the last 7 days data. This is in Hive 0.14.0.2.2.4.2-2 . My query currently looks like this : SELECT COUNT(column_name) from table_name where year >= year(date_sub(from_unixtime(unix_timestamp()), 7)) AND month >= month(date_sub(from_unixtime(unix_timestamp()), 7)) AND day >= day(date_sub(from_unixtime(unix_timestamp()), 7)); This takes a very long time. When I substitute the

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

阅读更多关于 Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node ( sparkDriverCount ) The number of worker nodes available to a Spark cluster ( numWorkerNodes ) The number of Spark executors ( numExecutors ) The DataFrame being operated on by all workers/executors, concurrently ( dataFrame ) The number of rows in the dataFrame ( numDFRows ) The number of partitions on the dataFrame ( numPartitions ) And finally, the number of CPU cores available on each worker nodes (

How to understand the dynamic programming solution in linear partitioning?

阅读更多关于 How to understand the dynamic programming solution in linear partitioning?

I'm struggling to understand the dynamic programming solution to linear partitioning problem. I am reading the The Algorithm Design Manual and the problem is described in section 8.5. I've read the section countless times but I'm just not getting it. I think it's a poor explanation (the what I've read up to now has been much better), but I've not been able to understand the problem well enough to look for an alternative explanation. Links to better explanations welcome! I've found a page with text similar to the book (maybe from the first edition of the book): The Partition Problem . First

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

阅读更多关于 Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

问题 A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table structure is: CREATE TABLE `IPAddresses` ( `id` int(11) unsigned NOT NULL auto_increment, `ipaddress` bigint(20) unsigned default NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM; And I added the 71 million records and then did a: ALTER TABLE IPAddresses

Write Spark dataframe as CSV with partitions

阅读更多关于 Write Spark dataframe as CSV with partitions

问题 I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the partitionBy notation Spark will create partition (similar to writing in Parquet format) folder in form of partition_column_name=partition_value ( i.e partition_date=2016-05-03 ). To do so, I ran the following command : (df.write .partitionBy('partition_date') .mode('overwrite') .format("com.databricks.spark.csv") .save('/tmp/af_organic')) but partition folders had not been created any idea what

Efficient way to divide a list into lists of n size

阅读更多关于 Efficient way to divide a list into lists of n size

I have an array, which I want to divide into smaller arrays of n size, and perform an operation on each. My current method of doing this is implemented with ArrayLists in Java (any pseudocode will do) for (int i = 1; i <= Math.floor((A.size() / n)); i++) { ArrayList temp = subArray(A, ((i * n) - n), (i * n) - 1); // do stuff with temp } private ArrayList<Comparable> subArray(ArrayList A, int start, int end) { ArrayList toReturn = new ArrayList(); for (int i = start; i <= end; i++) { toReturn.add(A.get(i)); } return toReturn; } where A is the list, n is the size of the desired lists I believe

How can I divide/split up a matrix by rows between two other matrices?

阅读更多关于 How can I divide/split up a matrix by rows between two other matrices?

问题 I have a matrix and a vector each with 3000 rows: fe = [-0.1850 -0.4485; ... -0.2150 2.6302; ... -0.2081 1.5883; ... -0.6416 -1.1924; ... -0.1188 1.3429; ... -0.2326 -2.2737; ... -0.0799 1.4821; ... ... %# lots more rows ]; tar = [1; ... 1; ... 2; ... 1; ... 2; ... 1; ... 1; ... ... %#lots more rows ]; I would like to divide up the rows of fe and tar such that 2/3 of them are placed into one set of variables and the remaining 1/3 are placed into a second set of variables. This is for

Spark lists all leaf node even in partitioned data

阅读更多关于 Spark lists all leaf node even in partitioned data

问题 I have parquet data partitioned by date & hour , folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data. query: select * from raw_events where event_date='2016-01-01' similar problem : http