partitioning | 易学教程

Algorithm to partition a string into substrings including null partitions

阅读更多关于 Algorithm to partition a string into substrings including null partitions

The problem: Let P be the set of all possible ways of partitioning string s into adjacent and possibly null substrings. I'm looking for an elegant algorithm to solve this problem. Background context: Given a tuple of strings (s, w), define P(s) and P(w) as above. There exists a particular partition R ∈ P(s) and T ∈ P(w) that yields the least number of substring Levenshtein (insertion, deletion and substitution) edits. An example: Partition string "foo" into 5 substrings, where ε is a null substring: [ε, ε, f, o, o] [ε, f, ε, o, o] [ε, f, o, ε, o] [ε, f, o, o, ε] [f, ε, ε, o, o] [f, ε, o, ε, o]

Cassandra partition size and performance?

阅读更多关于 Cassandra partition size and performance?

I was playing around with cassandra-stress tool on my own laptop (8 cores, 16GB) with Cassandra 2.2.3 installed out of the box with having its stock configuration. I was doing exactly what was described here: http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema And measuring its insert performance. My observations were: using the code from https://gist.github.com/tjake/fb166a659e8fe4c8d4a3 without any modifications I had ~7000 inserts/sec. when modifying line 35 in the code above (cluster: fixed(1000)) to "cluster: fixed(100)", i. e. configuring my test data

Partitioning a list of integers to minimize difference of their sums

阅读更多关于 Partitioning a list of integers to minimize difference of their sums

Given a list of integers l , how can I partition it into 2 lists a and b such that d(a,b) = abs(sum(a) - sum(b)) is minimum. I know the problem is NP-complete, so I am looking for a pseudo-polynomial time algorithm i.e. O(c*n) where c = sum(l map abs) . I looked at Wikipedia but the algorithm there is to partition it into exact halves which is a special case of what I am looking for... EDIT: To clarify, I am looking for the exact partitions a and b and not just the resulting minimum difference d(a, b) To generalize, what is a pseudo-polynomial time algorithm to partition a list of n numbers

Spark Streaming: How can I add more partitions to my DStream?

阅读更多关于 Spark Streaming: How can I add more partitions to my DStream?

问题 I have a spark-streaming app which looks like this: val message = KafkaUtils.createStream(...).map(_._2) message.foreachRDD( rdd => { if (!rdd.isEmpty){ val kafkaDF = sqlContext.read.json(rdd) kafkaDF.foreachPartition( i =>{ createConnection() i.foreach( row =>{ connection.sendToTable() } ) closeConnection() } ) And, I run it on a yarn cluster using spark-submit --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 5.... When I try to log kafkaDF.rdd

How to force a certain partitioning in a PySpark DataFrame?

阅读更多关于 How to force a certain partitioning in a PySpark DataFrame?

问题 Suppose I have a DataFrame with a column partition_id : n_partitions = 2 df = spark.sparkContext.parallelize([ [1, 'A'], [1, 'B'], [2, 'A'], [2, 'C'] ]).toDF(('partition_id', 'val')) How can I repartition the DataFrame to guarantee that each value of partition_id goes to a separate partition, and that there are exactly as many actual partitions as there are distinct values of partition_id ? If I do a hash partition, i.e. df.repartition(n_partitions, 'partition_id') , that guarantees the right

Will SQL Server Partitioning increase performance without changing filegroups

阅读更多关于 Will SQL Server Partitioning increase performance without changing filegroups

Scenario I have a 10 million row table. I partition it into 10 partitions, which results in 1 million rows per partition but I do not do anything else (like move the partitions to different file groups or spindles) Will I see a performance increase? Is this in effect like creating 10 smaller tables? If I have queries that perform key lookups or scans, will the performance increase as if they were operating against a much smaller table? I'm trying to understand how partitioning is different from just having a well indexed table, and where it can be used to improve performance. Would a better

Control data locality in Impala by partitioning

阅读更多关于 Control data locality in Impala by partitioning

I would like to avoid Impala nodes unnecessarily requesting data from other nodes over the network in cases when the ideal data locality or layout is known at table creation time. This would be helpful with 'non-additive' operations where all records from a partition are needed at the same place (node) anyway (for ex. percentiles). Is it possible to tell Impala that all data in a partition should always be co-located on a single node for any HDFS replica? In Impala-SQL, I am not sure if the "PARTITIONED BY" clause provide this feature. In my understanding, Impala chunks its partitions into

in postgresql, are partitions or multiple databases more efficient?

阅读更多关于 in postgresql, are partitions or multiple databases more efficient?

问题 have an application in which many companies post information. the data from each company is self contained - there is no data overlap. performance-wise, is it better to: keep the company ID on each row of each table and have each index use it? partition each table according to the company ID partition and create a user to access each company to ensure security create multiple databases, one for each company web-based application with persistent connections. my thoughts: new pg connections are

Optimizing a Partition Function

阅读更多关于 Optimizing a Partition Function

问题 Here is the code, in python: # function for pentagonal numbers def pent (n): return int((0.5*n)*((3*n)-1)) # function for generalized pentagonal numbers def gen_pent (n): return pent(int(((-1)**(n+1))*(round((n+1)/2)))) # array for storing partitions - first ten already stored partitions = [1, 1, 2, 3, 5, 7, 11, 15, 22, 30, 42] # function to generate partitions def partition (k): if (k < len(partitions)): return partitions[k] total, sign, i = 0, 1, 1 while (k - gen_pent(i)) >= 0: sign = (-1)*

SQL Server -is a GUID based PK the best practice to support tenant based horizontal partitioning

阅读更多关于 SQL Server -is a GUID based PK the best practice to support tenant based horizontal partitioning

问题 I’m trying to figure out what the best approach is when designing a multi tenant database schema that will need to be horizontally partitioned in the future. Some Rough Numbers on the database.. Total number of tenants will be approx 10,000. The amount of data stored per tenant varies between 500MB -> 3GB. The number of tenants will start off small and grow to 10,000 over a few years so initially we can start with a single multi tenant database but in the longer term this will need to scale