partitioning

How to control partition size in Spark SQL

旧巷老猫 提交于 2019-11-26 11:22:52
I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContex t to take number of partitions parameter. Repartitioning of the RDD causes shuffling and results in more processing time. > val result = sqlContext.sql("select * from bt_st_ent") Has the log output of: Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes) Starting task 1.0 in stage 131.0

Partitioning in spark while reading from RDBMS via JDBC

青春壹個敷衍的年華 提交于 2019-11-26 10:56:55
I am running spark in cluster mode and reading data from RDBMS via JDBC. As per Spark docs , these partitioning parameters describe how to partition the table when reading in parallel from multiple workers: partitionColumn lowerBound upperBound numPartitions These are optional parameters. What would happen if I don't specify these: Only 1 worker read the whole data? If it still reads parallelly, how does it partition data? If you don't specify either { partitionColumn , lowerBound , upperBound , numPartitions } or { predicates } Spark will use a single executor and create a single non-empty

Efficient way to divide a list into lists of n size

偶尔善良 提交于 2019-11-26 09:34:33
问题 I have an ArrayList, which I want to divide into smaller Lists of n size, and perform an operation on each. My current method of doing this is implemented with ArrayLists in Java (any pseudocode will do) for (int i = 1; i <= Math.floor((A.size() / n)); i++) { ArrayList temp = subArray(A, ((i * n) - n), (i * n) - 1); // do stuff with temp } private ArrayList<Comparable> subArray(ArrayList A, int start, int end) { ArrayList toReturn = new ArrayList(); for (int i = start; i <= end; i++) {

Default Partitioning Scheme in Spark

南楼画角 提交于 2019-11-26 07:47:20
问题 When I execute below command: scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist() rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22 scala> rdd.partitions.size res9: Int = 10 scala> rdd.partitioner.isDefined res10: Boolean = true scala> rdd.partitioner.get res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner@a It says that there are 10 partitions and partitioning is done using

How to control partition size in Spark SQL

僤鯓⒐⒋嵵緔 提交于 2019-11-26 05:52:18
问题 I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContex t to take number of partitions parameter. Repartitioning of the RDD causes shuffling and results in more processing time. > val result = sqlContext.sql(\"select * from bt_st_ent\") Has the log output of: Starting task 0.0

How to find all partitions of a set

为君一笑 提交于 2019-11-26 04:22:18
问题 I have a set of distinct values. I am looking for a way to generate all partitions of this set, i.e. all possible ways of dividing the set into subsets. For instance, the set {1, 2, 3} has the following partitions: { {1}, {2}, {3} }, { {1, 2}, {3} }, { {1, 3}, {2} }, { {1}, {2, 3} }, { {1, 2, 3} }. As these are sets in the mathematical sense, order is irrelevant. For instance, {1, 2}, {3} is the same as {3}, {2, 1} and should not be a separate result. A thorough definition of set partitions

How to optimize partitioning when migrating data from JDBC source?

耗尽温柔 提交于 2019-11-26 02:38:17
问题 I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code: val conf = new SparkConf().setAppName(\"Spark-JDBC\").set(\"spark.executor.heartbeatInterval\",\"120s\").set(\"spark.network.timeout\",\"12000s\").set(\"spark.sql.inMemoryColumnarStorage.compressed\", \"true\").set(\"spark.sql.orc.filterPushdown\",\"true\").set(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\").set(\"spark.kryoserializer.buffer

Partitioning in spark while reading from RDBMS via JDBC

我的梦境 提交于 2019-11-26 02:15:04
问题 I am running spark in cluster mode and reading data from RDBMS via JDBC. As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers: partitionColumn lowerBound upperBound numPartitions These are optional parameters. What would happen if I don\'t specify these: Only 1 worker read the whole data? If it still reads parallelly, how does it partition data? 回答1: If you don't specify either { partitionColumn , lowerBound ,

How to define partitioning of DataFrame?

时光怂恿深爱的人放手 提交于 2019-11-26 01:48:34
问题 I\'ve started using Spark SQL and DataFrames in Spark 1.4.0. I\'m wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this. One of the data tables I\'m working with contains a list of transactions, by account, silimar to the following example. Account Date Type Amount 1001 2014-04-01 Purchase 100.00 1001 2014-04-01 Purchase 50.00 1001 2014-04-05 Purchase 70.00 1001 2014-04-01 Payment -150.00 1002 2014-04-01 Purchase 80.00 1002 2014-04-02 Purchase 22.00

How does HashPartitioner work?

故事扮演 提交于 2019-11-25 23:49:11
问题 I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am under the assumption that HashPartitioner partitions the distributed set based on the hash of the keys. For example if my data is like (1,1), (1,2), (1,3), (2,1), (2,2), (2,3) So partitioner would put this into different partitions with same keys falling in the same partition. However I do not understand the significance of the constructor argument new HashPartitoner