partitioning

In Apache Spark, why does RDD.union not preserve the partitioner?

我是研究僧i 提交于 2019-11-27 13:11:05
As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) .partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd1.cogroup(rdd2) println("cogrouped: " + cogrouped.partitioner) val unioned = rdd1.union(rdd2) println("union: " + unioned.partitioner) I see that by default cogroup() always yields an RDD with the customized partitioner, but union() doesn't, it will always

Database - Designing an “Events” Table

▼魔方 西西 提交于 2019-11-27 09:52:23
问题 After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea: I've four types of users modeled in a Class Table Inheritance structure, in the

Why partitions elimination does not happen for this query?

两盒软妹~` 提交于 2019-11-27 08:53:21
问题 I have a hive table which is partitioned by year, month, day and hour. I need to run a query against it to fetch the last 7 days data. This is in Hive 0.14.0.2.2.4.2-2 . My query currently looks like this : SELECT COUNT(column_name) from table_name where year >= year(date_sub(from_unixtime(unix_timestamp()), 7)) AND month >= month(date_sub(from_unixtime(unix_timestamp()), 7)) AND day >= day(date_sub(from_unixtime(unix_timestamp()), 7)); This takes a very long time. When I substitute the

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

让人想犯罪 __ 提交于 2019-11-27 06:50:31
There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node ( sparkDriverCount ) The number of worker nodes available to a Spark cluster ( numWorkerNodes ) The number of Spark executors ( numExecutors ) The DataFrame being operated on by all workers/executors, concurrently ( dataFrame ) The number of rows in the dataFrame ( numDFRows ) The number of partitions on the dataFrame ( numPartitions ) And finally, the number of CPU cores available on each worker nodes (

How to understand the dynamic programming solution in linear partitioning?

烂漫一生 提交于 2019-11-27 06:12:50
I'm struggling to understand the dynamic programming solution to linear partitioning problem. I am reading the The Algorithm Design Manual and the problem is described in section 8.5. I've read the section countless times but I'm just not getting it. I think it's a poor explanation (the what I've read up to now has been much better), but I've not been able to understand the problem well enough to look for an alternative explanation. Links to better explanations welcome! I've found a page with text similar to the book (maybe from the first edition of the book): The Partition Problem . First

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

可紊 提交于 2019-11-27 05:10:40
问题 A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table structure is: CREATE TABLE `IPAddresses` ( `id` int(11) unsigned NOT NULL auto_increment, `ipaddress` bigint(20) unsigned default NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM; And I added the 71 million records and then did a: ALTER TABLE IPAddresses

Write Spark dataframe as CSV with partitions

♀尐吖头ヾ 提交于 2019-11-27 04:49:59
问题 I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the partitionBy notation Spark will create partition (similar to writing in Parquet format) folder in form of partition_column_name=partition_value ( i.e partition_date=2016-05-03 ). To do so, I ran the following command : (df.write .partitionBy('partition_date') .mode('overwrite') .format("com.databricks.spark.csv") .save('/tmp/af_organic')) but partition folders had not been created any idea what

Efficient way to divide a list into lists of n size

荒凉一梦 提交于 2019-11-27 03:55:03
I have an array, which I want to divide into smaller arrays of n size, and perform an operation on each. My current method of doing this is implemented with ArrayLists in Java (any pseudocode will do) for (int i = 1; i <= Math.floor((A.size() / n)); i++) { ArrayList temp = subArray(A, ((i * n) - n), (i * n) - 1); // do stuff with temp } private ArrayList<Comparable> subArray(ArrayList A, int start, int end) { ArrayList toReturn = new ArrayList(); for (int i = start; i <= end; i++) { toReturn.add(A.get(i)); } return toReturn; } where A is the list, n is the size of the desired lists I believe

How can I divide/split up a matrix by rows between two other matrices?

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-27 03:35:31
问题 I have a matrix and a vector each with 3000 rows: fe = [-0.1850 -0.4485; ... -0.2150 2.6302; ... -0.2081 1.5883; ... -0.6416 -1.1924; ... -0.1188 1.3429; ... -0.2326 -2.2737; ... -0.0799 1.4821; ... ... %# lots more rows ]; tar = [1; ... 1; ... 2; ... 1; ... 2; ... 1; ... 1; ... ... %#lots more rows ]; I would like to divide up the rows of fe and tar such that 2/3 of them are placed into one set of variables and the remaining 1/3 are placed into a second set of variables. This is for

Spark lists all leaf node even in partitioned data

若如初见. 提交于 2019-11-27 02:39:47
问题 I have parquet data partitioned by date & hour , folder structure: events_v3 -- event_date=2015-01-01 -- event_hour=2015-01-1 -- part10000.parquet.gz -- event_date=2015-01-02 -- event_hour=5 -- part10000.parquet.gz I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data. query: select * from raw_events where event_date='2016-01-01' similar problem : http