partitioning

Why partitions elimination does not happen for this query?

我是研究僧i 提交于 2019-11-28 14:37:26
I have a hive table which is partitioned by year, month, day and hour. I need to run a query against it to fetch the last 7 days data. This is in Hive 0.14.0.2.2.4.2-2 . My query currently looks like this : SELECT COUNT(column_name) from table_name where year >= year(date_sub(from_unixtime(unix_timestamp()), 7)) AND month >= month(date_sub(from_unixtime(unix_timestamp()), 7)) AND day >= day(date_sub(from_unixtime(unix_timestamp()), 7)); This takes a very long time. When I substitute the actual numbers for the above say something like : SELECT COUNT(column_name) from table_name where year >=

MySQL Partitioning / Sharding / Splitting - which way to go?

耗尽温柔 提交于 2019-11-28 14:37:00
问题 We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the

How can I ensure that a partition has representative observations from each level of a factor?

时光总嘲笑我的痴心妄想 提交于 2019-11-28 10:28:39
I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample(letters, 100, rep = T)), c = factor(sample(c("apple", "orange"), 100, rep = T))) set.seed(123) partition

which algorithm can do a stable in-place binary partition with only O(N) moves?

帅比萌擦擦* 提交于 2019-11-28 08:19:53
I'm trying to understand this paper: Stable minimum space partitioning in linear time. It seems that a critical part of the claim is that Algorithm B sorts stably a bit-array of size n in O(nlog 2 n) time and constant extra space, but makes only O(n) moves. However, the paper doesn't describe the algorithm, but only references another paper which I don't have access to. I can find several ways to do the sort within the time bounds, but I'm having trouble finding one that guarantees O(N) moves without also requiring more than constant space. What is this Algorithm B? In other words, given

Does Spark know the partitioning key of a DataFrame?

做~自己de王妃 提交于 2019-11-28 06:33:33
I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles. Context: Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so: val df0 = spark .read .format("csv") .option("header", true) .option("delimiter", ";") .option("inferSchema", false) .load("SomeFile.csv")) val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42) df.write .mode(SaveMode.Overwrite) .format("parquet") .option("inferSchema", false) .save("SomeFile.parquet") I am creating 42

How to handle id generation on a hadoop cluster?

試著忘記壹切 提交于 2019-11-28 05:59:09
问题 I am building a dictionary on a hadoop cluster and need to generate a numeric id for each token. How should I do it? 回答1: You have two problems. First you want to make sure that you assign exactly one id for each token. To do that you should sort and group records by token and make the assignment in a reducer. Once you've made sure that the reducer method is called exactly once for each token you can use the partition number from the context and a unique numeric id maintained by the reducer

Partition Hive table by existing field?

南笙酒味 提交于 2019-11-28 03:55:23
问题 Can I partition a Hive table upon insert by an existing field? I have a 10 GB file with a date field and an hour of day field. Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition? Would something like the following work? INSERT OVERWRITE TABLE tealeaf_event PARTITION(dt=evt.datestring,hour=evt.hour) SELECT * FROM staging_event evt; Thanks! Travis 回答1: I just ran across this trying to answer the same question and it was

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

坚强是说给别人听的谎言 提交于 2019-11-28 03:41:28
A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread , someone suggested that the current setup of my cluster is not suitable for my need. My table structure is: CREATE TABLE `IPAddresses` ( `id` int(11) unsigned NOT NULL auto_increment, `ipaddress` bigint(20) unsigned default NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM; And I added the 71 million records and then did a: ALTER TABLE IPAddresses ADD INDEX(ipaddress); It's been 14 hours and the operation is still not completed. Upon Googling, I

Handling very large data with mysql

穿精又带淫゛_ 提交于 2019-11-28 03:09:25
Sorry for the long post! I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and I don't have problem with the current database size. However, we will have a similar database (same datatypes, design ,..) but much larger, e.g., the "transaction" table will have about 1 billion records (about 2,3 million transaction per day) and we are thinking about how we should deal with such volume of data in MySQL? (it is both read and write

Write Spark dataframe as CSV with partitions

限于喜欢 提交于 2019-11-28 01:55:16
I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the partitionBy notation Spark will create partition (similar to writing in Parquet format) folder in form of partition_column_name=partition_value ( i.e partition_date=2016-05-03 ). To do so, I ran the following command : (df.write .partitionBy('partition_date') .mode('overwrite') .format("com.databricks.spark.csv") .save('/tmp/af_organic')) but partition folders had not been created any idea what sould I do in order for spark DF automatically create those folders? Thanks, zero323 Spark 2.0.0+ : Built