partitioning | 易学教程

Why partitions elimination does not happen for this query?

阅读更多关于 Why partitions elimination does not happen for this query?

I have a hive table which is partitioned by year, month, day and hour. I need to run a query against it to fetch the last 7 days data. This is in Hive 0.14.0.2.2.4.2-2 . My query currently looks like this : SELECT COUNT(column_name) from table_name where year >= year(date_sub(from_unixtime(unix_timestamp()), 7)) AND month >= month(date_sub(from_unixtime(unix_timestamp()), 7)) AND day >= day(date_sub(from_unixtime(unix_timestamp()), 7)); This takes a very long time. When I substitute the actual numbers for the above say something like : SELECT COUNT(column_name) from table_name where year >=

MySQL Partitioning / Sharding / Splitting - which way to go?

阅读更多关于 MySQL Partitioning / Sharding / Splitting - which way to go?

问题 We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the

How can I ensure that a partition has representative observations from each level of a factor?

阅读更多关于 How can I ensure that a partition has representative observations from each level of a factor?

I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample(letters, 100, rep = T)), c = factor(sample(c("apple", "orange"), 100, rep = T))) set.seed(123) partition

which algorithm can do a stable in-place binary partition with only O(N) moves?

阅读更多关于 which algorithm can do a stable in-place binary partition with only O(N) moves?

I'm trying to understand this paper: Stable minimum space partitioning in linear time. It seems that a critical part of the claim is that Algorithm B sorts stably a bit-array of size n in O(nlog 2 n) time and constant extra space, but makes only O(n) moves. However, the paper doesn't describe the algorithm, but only references another paper which I don't have access to. I can find several ways to do the sort within the time bounds, but I'm having trouble finding one that guarantees O(N) moves without also requiring more than constant space. What is this Algorithm B? In other words, given

Does Spark know the partitioning key of a DataFrame?

阅读更多关于 Does Spark know the partitioning key of a DataFrame?

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles. Context: Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so: val df0 = spark .read .format("csv") .option("header", true) .option("delimiter", ";") .option("inferSchema", false) .load("SomeFile.csv")) val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42) df.write .mode(SaveMode.Overwrite) .format("parquet") .option("inferSchema", false) .save("SomeFile.parquet") I am creating 42

How to handle id generation on a hadoop cluster?

阅读更多关于 How to handle id generation on a hadoop cluster?

问题 I am building a dictionary on a hadoop cluster and need to generate a numeric id for each token. How should I do it? 回答1: You have two problems. First you want to make sure that you assign exactly one id for each token. To do that you should sort and group records by token and make the assignment in a reducer. Once you've made sure that the reducer method is called exactly once for each token you can use the partition number from the context and a unique numeric id maintained by the reducer

Partition Hive table by existing field?

阅读更多关于 Partition Hive table by existing field?

问题 Can I partition a Hive table upon insert by an existing field? I have a 10 GB file with a date field and an hour of day field. Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition? Would something like the following work? INSERT OVERWRITE TABLE tealeaf_event PARTITION(dt=evt.datestring,hour=evt.hour) SELECT * FROM staging_event evt; Thanks! Travis 回答1: I just ran across this trying to answer the same question and it was

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

阅读更多关于 Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread , someone suggested that the current setup of my cluster is not suitable for my need. My table structure is: CREATE TABLE `IPAddresses` ( `id` int(11) unsigned NOT NULL auto_increment, `ipaddress` bigint(20) unsigned default NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM; And I added the 71 million records and then did a: ALTER TABLE IPAddresses ADD INDEX(ipaddress); It's been 14 hours and the operation is still not completed. Upon Googling, I

Handling very large data with mysql

阅读更多关于 Handling very large data with mysql

Sorry for the long post! I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and I don't have problem with the current database size. However, we will have a similar database (same datatypes, design ,..) but much larger, e.g., the "transaction" table will have about 1 billion records (about 2,3 million transaction per day) and we are thinking about how we should deal with such volume of data in MySQL? (it is both read and write

Write Spark dataframe as CSV with partitions

阅读更多关于 Write Spark dataframe as CSV with partitions

I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the partitionBy notation Spark will create partition (similar to writing in Parquet format) folder in form of partition_column_name=partition_value ( i.e partition_date=2016-05-03 ). To do so, I ran the following command : (df.write .partitionBy('partition_date') .mode('overwrite') .format("com.databricks.spark.csv") .save('/tmp/af_organic')) but partition folders had not been created any idea what sould I do in order for spark DF automatically create those folders? Thanks, zero323 Spark 2.0.0+ : Built