partitioning

Partitioning by Year vs. separate tables named Data_2011, Data_2010, etc

岁酱吖の 提交于 2019-12-01 00:49:34
We are designing a high volume SQL Server application that involves processing and reporting on data that is restricted within a specified year. Using Partitioning by year comes to mind. Another suggestion is to programmatically create separate physical table where the suffix of the name is the year and, when reporting is needed across years, to provide a view which is the union of the physical tables. My gut tells me that this situation is what partitioning is design to handle. Are there any advantages to using the other approach? From an internals perspective, the methods are essentially the

Spark: Order of column arguments in repartition vs partitionBy

你说的曾经没有我的故事 提交于 2019-11-30 22:24:10
Methods taken into consideration ( Spark 2.2.1 ): DataFrame.repartition (the two implementations that take partitionExprs: Column* parameters) DataFrameWriter.partitionBy Note: This question doesn't ask the difference between these methods From docs of partitionBy : If specified, the output is laid out on the file system similar to Hive 's partitioning scheme . As an example, when we partition a Dataset by year and then month, the directory layout would look like: year=2016/month=01/ year=2016/month=02/ From this, I infer that the order of column arguments will decide the directory layout;

spark read partitioned data in S3 partly in glacier

早过忘川 提交于 2019-11-30 21:39:02
I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have... s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier] ... s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier] s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier] ... s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier] I want to read this dataset, but only the a subset of date that are not yet in glacier, eg: val from = "2017-07-15" val to = "2017-08-24" val path = "s3://my-bucket/my-dataset/" val X = spark.read.parquet(path).where(col("dt"

Grouping lists into groups of X items per group

老子叫甜甜 提交于 2019-11-30 20:27:47
I'm having a problem knowing the best way to make a method to group a list of items into groups of (for example) no more than 3 items. I've created the method below, but without doing a ToList on the group before I return it, I have a problem with it if the list is enumerated multiple times. The first time it's enumerated is correct, but any additional enumeration is thrown off because the two variables (i and groupKey) appear to be remembered between the iterations. So the questions are: Is there a better way to do what I'm trying to achieve? Is simply ToListing the resulting group before it

Partitioning by Year vs. separate tables named Data_2011, Data_2010, etc

淺唱寂寞╮ 提交于 2019-11-30 18:49:39
问题 We are designing a high volume SQL Server application that involves processing and reporting on data that is restricted within a specified year. Using Partitioning by year comes to mind. Another suggestion is to programmatically create separate physical table where the suffix of the name is the year and, when reporting is needed across years, to provide a view which is the union of the physical tables. My gut tells me that this situation is what partitioning is design to handle. Are there any

How does range partitioner work in Spark?

雨燕双飞 提交于 2019-11-30 15:38:56
问题 I'm not so clear about how range partitioner works in Spark. It uses (Reservoir Sampling) to take samples. And I was confused by the way of computing the boundaries of the input. // This is the sample size we need to have roughly balanced output partitions, capped at 1M. val sampleSize = math.min(20.0 * partitions, 1e6) // Assume the input partitions are roughly balanced and over-sample a little bit. val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt Why

How does range partitioner work in Spark?

余生颓废 提交于 2019-11-30 15:10:34
I'm not so clear about how range partitioner works in Spark. It uses (Reservoir Sampling) to take samples. And I was confused by the way of computing the boundaries of the input. // This is the sample size we need to have roughly balanced output partitions, capped at 1M. val sampleSize = math.min(20.0 * partitions, 1e6) // Assume the input partitions are roughly balanced and over-sample a little bit. val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt Why the calculated sampleSize should multiply by 3.0? And how to get the boundary? Can someone show me some

How to perform one operation on each executor once in spark

戏子无情 提交于 2019-11-30 11:00:48
问题 I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction. For performing prediction, What I have tried is, Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD. ----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy. Download and load the model on

How to write an image containing multiple partitions to a USB flash drive on Windows using C++

无人久伴 提交于 2019-11-30 10:29:13
On Windows, you can only see the first partition on removable media. I want to write a C++ program that can write an image containing an MBR and 2 partitions of data to the USB flash drive. I don't need the 2nd partition to be viewable in Windows- I just need to be able to write this raw image to the USB flash drive from Windows/C++ such that later, when run on Linux, the 2 partitions can be seen. I have read about installing a filter driver that would end up treating the removable media as fixed, which would be nice for reading, but I just want to write this image with as little interference

MySQL Proxy Alternatives for Database Sharding

二次信任 提交于 2019-11-30 07:42:40
Are there any alternatives for MySQL Proxy. I don't want to use it since it's still in alpha. I will have 10 MySQL servers with table_1 table_2 table_3 table_4 ... table_10 spread across the 10 servers. Each table is identical in their structure, their just shards with different data sets. Is there a alternative to MySQL Proxy, where I can have my client application connect to a single SQL Server (A proxy), which looks at the query and fetches the data on behalf of it. For example, if the client requests "SELECT * FROM table_5 WHERE user=123" from the Proxy, which connects to the 5th SQL