partitioning

Algorithm to partition a string into substrings including null partitions

我的梦境 提交于 2019-12-04 19:17:01
The problem: Let P be the set of all possible ways of partitioning string s into adjacent and possibly null substrings. I'm looking for an elegant algorithm to solve this problem. Background context: Given a tuple of strings (s, w), define P(s) and P(w) as above. There exists a particular partition R ∈ P(s) and T ∈ P(w) that yields the least number of substring Levenshtein (insertion, deletion and substitution) edits. An example: Partition string "foo" into 5 substrings, where ε is a null substring: [ε, ε, f, o, o] [ε, f, ε, o, o] [ε, f, o, ε, o] [ε, f, o, o, ε] [f, ε, ε, o, o] [f, ε, o, ε, o]

Cassandra partition size and performance?

主宰稳场 提交于 2019-12-04 17:16:12
I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here: http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema And measuring its insert performance. My observations were: using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec. when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data

Partitioning a list of integers to minimize difference of their sums

ぃ、小莉子 提交于 2019-12-04 14:50:57
Given a list of integers l , how can I partition it into 2 lists a and b such that d(a,b) = abs(sum(a) - sum(b)) is minimum. I know the problem is NP-complete, so I am looking for a pseudo-polynomial time algorithm i.e. O(c*n) where c = sum(l map abs) . I looked at Wikipedia but the algorithm there is to partition it into exact halves which is a special case of what I am looking for... EDIT: To clarify, I am looking for the exact partitions a and b and not just the resulting minimum difference d(a, b) To generalize, what is a pseudo-polynomial time algorithm to partition a list of n numbers

Spark Streaming: How can I add more partitions to my DStream?

痞子三分冷 提交于 2019-12-04 14:21:43
问题 I have a spark-streaming app which looks like this: val message = KafkaUtils.createStream(...).map(_._2) message.foreachRDD( rdd => { if (!rdd.isEmpty){ val kafkaDF = sqlContext.read.json(rdd) kafkaDF.foreachPartition( i =>{ createConnection() i.foreach( row =>{ connection.sendToTable() } ) closeConnection() } ) And, I run it on a yarn cluster using spark-submit --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 5.... When I try to log kafkaDF.rdd

How to force a certain partitioning in a PySpark DataFrame?

微笑、不失礼 提交于 2019-12-04 13:17:13
问题 Suppose I have a DataFrame with a column partition_id : n_partitions = 2 df = spark.sparkContext.parallelize([ [1, 'A'], [1, 'B'], [2, 'A'], [2, 'C'] ]).toDF(('partition_id', 'val')) How can I repartition the DataFrame to guarantee that each value of partition_id goes to a separate partition, and that there are exactly as many actual partitions as there are distinct values of partition_id ? If I do a hash partition, i.e. df.repartition(n_partitions, 'partition_id') , that guarantees the right

Will SQL Server Partitioning increase performance without changing filegroups

為{幸葍}努か 提交于 2019-12-04 13:04:06
Scenario I have a 10 million row table. I partition it into 10 partitions, which results in 1 million rows per partition but I do not do anything else (like move the partitions to different file groups or spindles) Will I see a performance increase? Is this in effect like creating 10 smaller tables? If I have queries that perform key lookups or scans, will the performance increase as if they were operating against a much smaller table? I'm trying to understand how partitioning is different from just having a well indexed table, and where it can be used to improve performance. Would a better

Control data locality in Impala by partitioning

那年仲夏 提交于 2019-12-04 12:53:58
I would like to avoid Impala nodes unnecessarily requesting data from other nodes over the network in cases when the ideal data locality or layout is known at table creation time. This would be helpful with 'non-additive' operations where all records from a partition are needed at the same place (node) anyway (for ex. percentiles). Is it possible to tell Impala that all data in a partition should always be co-located on a single node for any HDFS replica? In Impala-SQL, I am not sure if the "PARTITIONED BY" clause provide this feature. In my understanding, Impala chunks its partitions into

in postgresql, are partitions or multiple databases more efficient?

只谈情不闲聊 提交于 2019-12-04 12:21:30
问题 have an application in which many companies post information. the data from each company is self contained - there is no data overlap. performance-wise, is it better to: keep the company ID on each row of each table and have each index use it? partition each table according to the company ID partition and create a user to access each company to ensure security create multiple databases, one for each company web-based application with persistent connections. my thoughts: new pg connections are

Optimizing a Partition Function

时光毁灭记忆、已成空白 提交于 2019-12-04 11:53:52
问题 Here is the code, in python: # function for pentagonal numbers def pent (n): return int((0.5*n)*((3*n)-1)) # function for generalized pentagonal numbers def gen_pent (n): return pent(int(((-1)**(n+1))*(round((n+1)/2)))) # array for storing partitions - first ten already stored partitions = [1, 1, 2, 3, 5, 7, 11, 15, 22, 30, 42] # function to generate partitions def partition (k): if (k < len(partitions)): return partitions[k] total, sign, i = 0, 1, 1 while (k - gen_pent(i)) >= 0: sign = (-1)*

SQL Server -is a GUID based PK the best practice to support tenant based horizontal partitioning

谁说胖子不能爱 提交于 2019-12-04 11:27:24
问题 I’m trying to figure out what the best approach is when designing a multi tenant database schema that will need to be horizontally partitioned in the future. Some Rough Numbers on the database.. Total number of tenants will be approx 10,000. The amount of data stored per tenant varies between 500MB -> 3GB. The number of tenants will start off small and grow to 10,000 over a few years so initially we can start with a single multi tenant database but in the longer term this will need to scale