I am looking to create a special type of combination in which no two sets have more than one intersecting element. Let me explain with an example:
Let us say we have 9 letter set that contains A, B, C, D, E, F, G, H, and I
If you create the standard non-repeating combinations of three letters you will have 9C3 sets. These will contain sets like ABC, ABD, BCD, etc. I am looking to create sets that have at the most only 1 common letter. So in this example, we will get following sets:
ABC, ADG, AEI, AFH, BEH, BFG, BDI, CFI, CDH, CEG, DEF, and GHI - note that if you take any two sets there are no more than 1 repeating letter.
What would be a good way to generate such sets? It should be scalable solution so that I can do it for a set of 1000 letters, with sub set size of 4.
Any help is highly appreciated.
Thanks
I had to add another answer as the other one was too long already.
If you have the following constraints:
1) You need groups of 4 every week.
2) Each group in a certain week is disjoint and each student is in exactly one group.
3) If two students are in the same group, they need cannot be in the same group in future.
If you construct a graph G as follows:
1) Each student is a node.
2) Two students are joined by an edge iff they haven't been together in a group before.
With students dropping/joining arbitrarily, this becomes a hard problem! Even though you start out with a complete graph intially, after some weeks, the graph could become quite unpredictable.
Your problem: You need to find a spanning subgraph of G, such that it is a union of copies of K_4 or in other words a partition into K_4s.
Unfortunately, look like this problem is NP-Hard: Exact cover by 4-sets (which is NP-Complete) can be reduced to your problem (just like Exact cover by 3-sets can be reduced to partition into triangles).
Perhaps this will help give some aproximation algorithms: http://www.siam.org/proceedings/soda/2010/SODA10_122_chany.pdf
(Your problem can be reduced to Hypergraph matching and so algorithms for that can be used for your problem).
Exact Cover: http://en.wikipedia.org/wiki/Exact_cover
Partition into triangles: https://noppa.tkk.fi/noppa/kurssi/t-79.5103/viikkoharjoitukset/T-79_5103_solutions_5.pdf
Exact Cover by 4 sets = Given a set S of size 4m and a collection C of 4-element subsets of S, does there exist a subset C' of C, such that each element of S appears precisely once in C'.
Unfortunately, seems like you might have to change some constraints.
Say you have n letters (or students, or whatever), and every week want to partition them into subsets of size k (for a total of n/k subsets every week). This method will generate almost n/k subsets every week - I show below how to extend it to generate exactly n/k subsets.
Generating the Subsets (no partitioning)
First pick p, the largest prime <= n/k.
Let's consider every ordered pair (a,b) such that
0 <= a < k
0 <= b < p
We can map each pairing to one of our letters; thus, we can map p*k <= n letters this way (again, I show below how to map exactly n letters)
(0,0) => 'A'
(0,1) => 'B'
...
(0,p-1) => 'F'
(1,0) => 'G'
(1,1) => 'H'
...
(k-1,p-1) => 's'
Now, given
0 <= w < p
0 <= i < p
We can create a set Sw(i) of our ordered pairs. Each pairing in Sw(i) will represent one letter (according to our mapping above), and the set Sw(i) itself represents one "grouping of letters" aka. one subset of size k.
The formula for Sw(i) is
Sw(i) = {(0,i mod p), (1,(w+i) mod p), (2,(2w+i) mod p),..., ((k-1),((k-1)*w+i) mod p)}
= { (x,y) | 0 <= x < k and y = w*x + i (mod p)}
If we vary w and i over all possible values, we get p2 total sets. When we take any two of these sets, they will have at most one intersecting element.
How it works
Say we have two sets Sw1(i1) and Sw2(i2). If Sw1(i1) and Sw2(i2) have more than one element in common, then there exists at least two x such that
w1*x + i1 = w2*x + i2 (mod p) (w1-w2)*x + (i1-i2) = 0 (mod p)
However, anyone who's taken modular arithmetic knows that if p is prime, either x has a unique solution or (w1 = w2 and i1 = i2); thus, there cannot be more than one x, and Sw1(i1) and Sw2(i2) can have at most one intersecting element.
Analysis
Since p < n/k, by Chebyshev's Theorem (which states there is a prime between x and 2x for x > 3)
n/2k < p <= n/k
Thus, this method generates at least (n/2k)2 subsets of letters, though in practice p will be nearer to n/k, so the number will be nearer to (n/k)2. Since a simple upper bound for the maximum possible such subsets is n(n-1)/(k(k-1)) (see BlueRaja's comment below), this means the algorithm is asymptotically optimal, and will generate near the optimal amount of sets (even in the worst case, it won't generate less than about 1/4th the optimal amount; see again the comment below)
Partitioning
You now want to group the letters into partitions each week: each week, all letters are included in exactly one group.
We do this by letting w be fixed to a certain value (representing the week) and letting i vary from 0 to p-1.
Proof
Consider the groups we created:
Sw(i) = { (x,y) | 0 <= x < k and y = w*x + i (mod p)}
Let's say w is fixed and i varies from 0 to p-1. Then we get p sets:
Sw(0), Sw(1), ..., Sw(p-1)
Now let's say Sw(i1) and Sw(i2) (with i1 =/= i2) intersect; then
w*x + i1 = w*x + i2 (mod p)
for some x, and hence i1 = i2. Thus, Sw(i1) and Sw(i2) don't intersect.
Since no two of our sets intersect, and there are exactly p of them (each with k elements), our sets form a partition of the k*p letters.
Generating n/k Subsets Each Week
The biggest disadvantage of this method is that it generates sets for p*k letters, rather than n letters. If the last letters can't be left out (as in your case, where the letters are students), there are two ways to generate exactly n/k subsets each week:
Find a set of prime numbers p1, p2, p3, ... which sums up to exactly n/k. Then we can treat each group of pik letters as an independent alphabet, so that rather than finding subsets of pk letters, we find one group of subsets for p1*k letters, another group of subsets for p2*k letters, another group...
This has the disadvantage that letters from one group will never be paired with letters from another group, reducing the total number of subsets generated. Luckily, if n is even, by Goldbach's conjecture† you will only need two groups at the most (if n is odd, you will only need three at most)
This method guarantees subsets of size k, but doesn't generate as many subsets.
† Though unproven, it is known to be true for every ridiculously large number you will likely encounter for this problemThe other option is to use the smallest prime p >= n/k. This will give you p*k >= n letters - after the subsets have been generated, simply throw out the extra letters. Thus, in the end this gives you some subsets with size < k. Assuming k divides n evenly (ie. n/k is an integer), you could take the smaller subsets and mix them up by hand to make subsets of size k, but you risk having some overlap with past/future subsets this way.
This method generates at least as many subsets as the original method, but some may have size < k
Example
Take n = 15, k = 3. i.e. there are 15 students and we are making groups of three.
To begin with, we pick largest prime p <= n/k. n/k is prime (lucky us!), so p = 5.
We map the 15 students into the ordered pairs (a,b) described above, giving us (each letter is a student):
(0,0) = A
(0,1) = B
(0,2) = C
(0,3) = D
(0,4) = E
(1,0) = F
(1,1) = G
(1,2) = H
(1,3) = I
(1,4) = J
(2,0) = K
(2,1) = L
(2,2) = M
(2,3) = N
(2,4) = O
The method generates 25 groups of three. Thus, since we need to schedule n/k = 5 groups each week, we can schedule 5 weeks of activities (5 groups a week * 5 weeks = 25 groups).
For week 0, we generate the partition as
S0(i), for i = 0 to 4.
S0(0) = { (0,0), (1,0), (2,0) } = AFK
S0(1) = { (0,1), (1,1), (2,1) } = BGL
S0(2) = { (0,2), (1,2), (2,2) } = CHM
S0(3) = { (0,3), (1,3), (2,3) } = DIN
S0(4) = { (0,4), (1,4), (2,4) } = EJO
For week 4 it will be
S4(i) for i = 0 to 4.
S4(0) = { (0,0), (1, (4*1 + 0) mod 5), (2, (2*4 + 0) mod 5) }
= { (0,0), (1,4), (2,3) }
= AJN
S4(1) = { (0,1), (1, (4*1 + 1) mod 5), (2, (4*2 + 1) mod 5) }
= { (0,1), (1,0), (2,4) }
= BFO
S4(2) = { (0,2), (1, (4*1 + 2) mod 5), (2, (4*2 + 2) mod 5) }
= { (0,2), (1,1), (2,0) }
= CGK
S4(3) = { (0,3), (1, (4*1 + 3) mod 5), (2, (4*2 + 3) mod 5) }
= { (0,3), (1,2), (2,1) }
= DHL
S4(4) = { (0,4), (1, (4*1 + 4) mod 5), (2, (4*2 + 4) mod 5) }
= { (0,4), (1,3), (2,2) }
= EIM
Here's the schedule for all 5 weeks:
Week: 0
S0(0) ={(0,0) (1,0) (2,0) } = AFK
S0(1) ={(0,1) (1,1) (2,1) } = BGL
S0(2) ={(0,2) (1,2) (2,2) } = CHM
S0(3) ={(0,3) (1,3) (2,3) } = DIN
S0(4) ={(0,4) (1,4) (2,4) } = EJO
Week: 1
S1(0) ={(0,0) (1,1) (2,2) } = AGM
S1(1) ={(0,1) (1,2) (2,3) } = BHN
S1(2) ={(0,2) (1,3) (2,4) } = CIO
S1(3) ={(0,3) (1,4) (2,0) } = DJK
S1(4) ={(0,4) (1,0) (2,1) } = EFL
Week: 2
S2(0) ={(0,0) (1,2) (2,4) } = AHO
S2(1) ={(0,1) (1,3) (2,0) } = BIK
S2(2) ={(0,2) (1,4) (2,1) } = CJL
S2(3) ={(0,3) (1,0) (2,2) } = DFM
S2(4) ={(0,4) (1,1) (2,3) } = EGN
Week: 3
S3(0) ={(0,0) (1,3) (2,1) } = AIL
S3(1) ={(0,1) (1,4) (2,2) } = BJM
S3(2) ={(0,2) (1,0) (2,3) } = CFN
S3(3) ={(0,3) (1,1) (2,4) } = DGO
S3(4) ={(0,4) (1,2) (2,0) } = EHK
Week: 4
S4(0) ={(0,0) (1,4) (2,3) } = AJN
S4(1) ={(0,1) (1,0) (2,4) } = BFO
S4(2) ={(0,2) (1,1) (2,0) } = CGK
S4(3) ={(0,3) (1,2) (2,1) } = DHL
S4(4) ={(0,4) (1,3) (2,2) } = EIM
More Practical Example
In your case, n = 1000 students and k = 4 in each group. Thus, we pick p as the largest prime <= (n/k = 1000/4 = 250), so p = 241. Without considering the alterations above under "Generating n/k Subsets Each Week", this method will generate a schedule for 961 students lasting 241 weeks.
(An upper-bound for the maximum number of subsets possible would be 1000*999/(4*3) = 83250, though the actual number is likely less than that. Even so, this method generates 58081 subsets, or about 70% of the theoretical maximum!)
If we use the first method above to generate a schedule for exactly 1000 students, we take p1 = 113, p2 = 137 (so that p1 + p2 = n/k). Thus, we can generate (113)^2 + (137)^2 = 31,538 subsets of students, enough to last 113 weeks.
If we use the second method above to generate a schedule for exactly 1000 students, we take p = 251. This will give us a schedule for 1004 students for 251 weeks; we remove the 4 phantom students from the schedule each week. Usually, this will result in four groups of 3 every week (though unlikely, it is also possible to have for example one group of 2 and two groups of 3). The groups with < 4 students will always have a multiple-of-4 total number of students, so you could manually place those students into groups of 4, at the risk of potentially having two of those students together again later in another group.
Final thoughts
One flaw of this algorithm is that it's not really flexible: if a student drops out, we are forever stuck with a phantom student. Also, there is no way to add new students to the schedule midway through the year (unless we allow for them by initially creating phantom students).
This problem falls under the category of Restricted Set Systems in combinatorics. See this paper for more information, especially Chapters 1 and 2. Since it is a postscript file, you will need gsview or something to view it.
Here is some outline of the algorithm.
- First find all the pairs: AB BC CD DE EF FG GH HI AC BD CE DF EG FH GI AD BE CF DG EH FI AE BF CG DH EI AF BG CH DI AG BH CI AH BI AI
These can be stored in an array of sixe n(n-1)
Now start attempting to combine consecutive pairs using the following rules: a. Two pairs can be combined only when there is a common character. b. The combination is possible when the pair formed by leaving the common character is also available. e.g. if we want to combine AB and AC then we need to check if BC is also available. c. When the above rules are satisfied we combine the two pairs into a triple (e.g. AB and AC merged to form ABC) and mark all the three pairs, such as AB, AC and BC as unavailable.
Continue doing this looking for available pairs in the array and merging them to form triples and marking the pairs unavailable until there are no available pairs or no triples can be formed anymore.
Example: 1. combine AB + AC --> ABC; Mark AB, AC and BC unavailable. 2. combine AD + AE --> AED; Mark AD, AE and DE unavailable. 3. combine AF + AG --> AFG; Mark AF, AG and FG unavailable. ..
Here's an approach which will satisfy the requirements you have stated. Whether it does what you want I don't know.
- On a large sheet of paper draw a regular grid with at least 250 squares.
- Label the sides of the squares with the letters in your alphabet (250 squares x 4 sides == 1000).
- Each square defines one of your subsets. Each shares one side (ie one letter) only with each of its (up to) 4 neighbours. No side (ie letter) is shared by more than 2 squares (subsets).
I'll leave it up to you to turn this into working code, but I don't think that it should be too difficult and it should scale well. Note that it should also work for any other size of subset, you can tile the plane with irregular n-gons for any n, though it might get difficult.
The other approach I thought of is:
- On a large sheet of paper draw N dots, where N is the size of your alphabet. Label each dot with one of the letters of the alphabet.
- Connect any n dots, where n is the size of the required subsets. That's your first subset.
- Choose one of the already-connected dots, connect it to (n-1) more unconnected dots. That's your next subset.
- Repeat step 3 until you are finished.
This requires a bit more book-keeping, and there are a number of corner cases to deal with, but again it shoudln't be too difficult to code up.
EDIT: It's easy to transform my first approach into a form which is more obviously an algorithm on a graph. Model each subset as a node, and each letter in the alphabet as an edge. Construct a graph where each node has degree n (number of elements in the subset) and each letter (edge) is used once.
@khuss
Same method can be generalized. But it is not a linear algorithm and may be exponential.
For example: When subset size is 4, we pick 2 or more pairs with 4 unique characters.
e.g. "AB and CD" or "AB, AC & AD" only if the following conditions are satisfied:
- All the pairs formed by the characters of the 4-tuple are available. e.g. if we want to form ABCD using AB, AC & AD then all the pairs fromed out A, B, C & D i.e. AB, AC, AD, BC, BD, CD all must be available.
- If condition 1 is satisfied then we form ABCD and mark the C(4,2) = 6 pairs as unavailable.
We continue as before until no more 4-tuples can be formed or no more pairs are available.
来源:https://stackoverflow.com/questions/2955318/creating-combinations-that-have-no-more-one-intersecting-element