partitioning

Does Spark maintain parquet partitioning on read?

时光怂恿深爱的人放手 提交于 2019-12-18 15:28:41
问题 I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below: df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file") Now later on I would like to read the parquet file so I do something like this: val df = spark.read.parquet("/path/to/parquet/file") Is the dataframe partitioned by "DATE" ? In other words if a parquet

How to create a PostgreSQL partitioned sequence?

試著忘記壹切 提交于 2019-12-18 06:31:07
问题 Is there a simple (ie. non-hacky) and race-condition free way to create a partitioned sequence in PostgreSQL. Example: Using a normal sequence in Issue: | Project_ID | Issue | | 1 | 1 | | 1 | 2 | | 2 | 3 | | 2 | 4 | Using a partitioned sequence in Issue: | Project_ID | Issue | | 1 | 1 | | 1 | 2 | | 2 | 1 | | 2 | 2 | 回答1: I do not believe there is a simple way that is as easy as regular sequences, because: A sequence stores only one number stream (next value, etc.). You want one for each

SQL Error: ORA-14006: invalid partition name

允我心安 提交于 2019-12-18 03:47:16
问题 I am trying to partition an existing table in Oracle 12C R1 using below SQL statement. ALTER TABLE TABLE_NAME MODIFY PARTITION BY RANGE (DATE_COLUMN_NAME) INTERVAL (NUMTOYMINTERVAL(1,'MONTH')) ( PARTITION part_01 VALUES LESS THAN (TO_DATE('01-SEP-2017', 'DD-MON-RRRR')) ) ONLINE; Getting error: Error report - SQL Error: ORA-14006: invalid partition name 14006. 00000 - "invalid partition name" *Cause: a partition name of the form <identifier> is expected but not present. *Action: enter an

Hive doesn't read partitioned parquet files generated by Spark

£可爱£侵袭症+ 提交于 2019-12-18 01:15:14
问题 I'm having a problem to read partitioned parquet files generated by Spark in Hive. I'm able to create the external table in hive but when I try to select a few lines, hive returns only an "OK" message with no rows. I'm able to read the partitioned parquet files correctly in Spark, so I'm assuming that they were generated correctly. I'm also able to read these files when I create an external table in hive without partitioning. Does anyone have a suggestion? My Environment is: Cluster EMR 4.1.0

How can I ensure that a partition has representative observations from each level of a factor?

放肆的年华 提交于 2019-12-17 19:26:00
问题 I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample

Does Spark know the partitioning key of a DataFrame?

匆匆过客 提交于 2019-12-17 17:46:10
问题 I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles. Context: Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so: val df0 = spark .read .format("csv") .option("header", true) .option("delimiter", ";") .option("inferSchema", false) .load("SomeFile.csv")) val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42) df.write .mode(SaveMode

Is there an efficient algorithm for integer partitioning with restricted number of parts?

爷,独闯天下 提交于 2019-12-17 16:09:47
问题 I have to create a method that takes two integers, let them be n and m , and returns how many ways there are to sum m positive numbers to get n . For example, a method call like this partition(6, 2) should return 3 because there are 3 ways possible. They are 5 + 1 , 4 + 2 , and 3 + 3 . By the way, 4 + 2 is the same as 2 + 4 , so the method should not count them as two distinct variations. Does anybody know a solution to the problem? Updated: n and m are not greater than 150. 回答1: recursive

Split a list of numbers into n chunks such that the chunks have (close to) equal sums and keep the original order

◇◆丶佛笑我妖孽 提交于 2019-12-17 10:54:12
问题 This is not the standard partitioning problem, as I need to maintain the order of elements in the list. So for example if I have a list [1, 6, 2, 3, 4, 1, 7, 6, 4] and I want two chunks, then the split should give [[1, 6, 2, 3, 4, 1], [7, 6, 4]] for a sum of 17 on each side. For three chunks the result would be [[1, 6, 2, 3], [4, 1, 7], [6, 4]] for sums of 12, 12, and 10. Edit for additional explanation I currently divide the sum with the number of chunks and use that as a target, then

In Apache Spark, why does RDD.union not preserve the partitioner?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-17 10:41:10
问题 As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) .partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd1.cogroup(rdd2) println("cogrouped: " + cogrouped.partitioner) val unioned = rdd1.union(rdd2) println("union: " + unioned.partitioner) I see that by

Why does sortBy transformation trigger a Spark job?

不打扰是莪最后的温柔 提交于 2019-12-17 09:55:46
问题 As per Spark documentation only RDD actions can trigger a Spark job and the transformations are lazily evaluated when an action is called on it. I see the sortBy transformation function is applied immediately and it is shown as a job trigger in the SparkUI. Why? 回答1: sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD