Algorithm for equiprobable random square binary matrices with two non-adjacent non-zeros in each row and column

后端 未结 5 1268
遇见更好的自我
遇见更好的自我 2020-12-09 10:22

It would be great if someone could point me towards an algorithm that would allow me to :

  1. create a random square matrix, with entries 0 and 1, such that
  2. <
5条回答
  •  执笔经年
    2020-12-09 11:04

    Intro

    Here is some prototype-approach, trying to solve the more general task of uniform combinatorial sampling, which for our approach here means: we can use this approach for everything which we can formulate as SAT-problem.

    It's not exploiting your problem directly and takes a heavy detour. This detour to the SAT-problem can help in regards to theory (more powerful general theoretical results) and efficiency (SAT-solvers).

    That being said, it's not an approach if you want to sample within seconds or less (in my experiments), at least while being concerned about uniformity.

    Theory

    The approach, based on results from complexity-theory, follows this work:

    GOMES, Carla P.; SABHARWAL, Ashish; SELMAN, Bart. Near-uniform sampling of combinatorial spaces using XOR constraints. In: Advances In Neural Information Processing Systems. 2007. S. 481-488.

    The basic idea:

    • formulate the problem as SAT-problem
    • add randomly generated xors to the problem (acting on the decision-variables only! that's important in practice)
      • this will reduce the number of solutions (some solutions will get impossible)
      • do that in a loop (with tuned parameters) until only one solution is left!
        • search for some solution is being done by SAT-solvers or #SAT-solvers (=model-counting)
        • if there is more than one solution: no xors will be added but a complete restart will be done: add random-xors to the start-problem!

    The guarantees:

    • when tuning the parameters right, this approach achieves near-uniform sampling
      • this tuning can be costly, as it's based on approximating the number of possible solutions
      • empirically this can also be costly!
      • Ante's answer, mentioning the number sequence A001499 actually gives a nice upper bound on the solution-space (as it's just ignoring adjacency-constraints!)

    The drawbacks:

    • inefficient for large problems (in general; not necessarily compared to the alternatives like MCMC and co.)
      • need to change / reduce parameters to produce samples
      • those reduced parameters lose the theoretical guarantees
      • but empirically: good results are still possible!

    Parameters:

    In practice, the parameters are:

    • N: number of xors added
    • L: minimum number of variables part of one xor-constraint
    • U: maximum number of variables part of one xor-constraint

    N is important to reduce the number of possible solutions. Given N constant, the other variables of course also have some effect on that.

    Theory says (if i interpret correctly), that we should use L = R = 0.5 * #dec-vars.

    This is impossible in practice here, as xor-constraints hurt SAT-solvers a lot!

    Here some more scientific slides about the impact of L and U.

    They call xors of size 8-20 short-XORS, while we will need to use even shorter ones later!

    Implementation

    Final version

    Here is a pretty hacky implementation in python, using the XorSample scripts from here.

    The underlying SAT-solver in use is Cryptominisat.

    The code basically boils down to:

    • Transform the problem to conjunctive normal-form
      • as DIMACS-CNF
    • Implement the sampling-approach:
      • Calls XorSample (pipe-based + file-based)
      • Call SAT-solver (file-based)
    • Add samples to some file for later analysis

    Code: (i hope i did warn you already about the code-quality)

    from itertools import count
    from time import time
    import subprocess
    import numpy as np
    import os
    import shelve
    import uuid
    import pickle
    from random import SystemRandom
    cryptogen = SystemRandom()
    
    """ Helper functions """
    # K-ARY CONSTRAINT GENERATION
    # ###########################
    # SINZ, Carsten. Towards an optimal CNF encoding of boolean cardinality constraints.
    # CP, 2005, 3709. Jg., S. 827-831.
    
    def next_var_index(start):
        next_var = start
        while(True):
            yield next_var
            next_var += 1
    
    class s_index():
        def __init__(self, start_index):
            self.firstEnvVar = start_index
    
        def next(self,i,j,k):
            return self.firstEnvVar + i*k +j
    
    def gen_seq_circuit(k, input_indices, next_var_index_gen):
        cnf_string = ''
        s_index_gen = s_index(next_var_index_gen.next())
    
        # write clauses of first partial sum (i.e. i=0)
        cnf_string += (str(-input_indices[0]) + ' ' + str(s_index_gen.next(0,0,k)) + ' 0\n')
        for i in range(1, k):
            cnf_string += (str(-s_index_gen.next(0, i, k)) + ' 0\n')
    
        # write clauses for general case (i.e. 0 < i < n-1)
        for i in range(1, len(input_indices)-1):
            cnf_string += (str(-input_indices[i]) + ' ' + str(s_index_gen.next(i, 0, k)) + ' 0\n')
            cnf_string += (str(-s_index_gen.next(i-1, 0, k)) + ' ' + str(s_index_gen.next(i, 0, k)) + ' 0\n')
            for u in range(1, k):
                cnf_string += (str(-input_indices[i]) + ' ' + str(-s_index_gen.next(i-1, u-1, k)) + ' ' + str(s_index_gen.next(i, u, k)) + ' 0\n')
                cnf_string += (str(-s_index_gen.next(i-1, u, k)) + ' ' + str(s_index_gen.next(i, u, k)) + ' 0\n')
            cnf_string += (str(-input_indices[i]) + ' ' + str(-s_index_gen.next(i-1, k-1, k)) + ' 0\n')
    
        # last clause for last variable
        cnf_string += (str(-input_indices[-1]) + ' ' + str(-s_index_gen.next(len(input_indices)-2, k-1, k)) + ' 0\n')
    
        return (cnf_string, (len(input_indices)-1)*k, 2*len(input_indices)*k + len(input_indices) - 3*k - 1)
    
    # K=2 clause GENERATION
    # #####################
    def gen_at_most_2_constraints(vars, start_var):
        constraint_string = ''
        used_clauses = 0
        used_vars = 0
        index_gen = next_var_index(start_var)
        circuit = gen_seq_circuit(2, vars, index_gen)
        constraint_string += circuit[0]
        used_clauses += circuit[2]
        used_vars += circuit[1]
        start_var += circuit[1]
    
        return [constraint_string, used_clauses, used_vars, start_var]
    
    def gen_at_least_2_constraints(vars, start_var):
        k = len(vars) - 2
        vars = [-var for var in vars]
    
        constraint_string = ''
        used_clauses = 0
        used_vars = 0
        index_gen = next_var_index(start_var)
        circuit = gen_seq_circuit(k, vars, index_gen)
        constraint_string += circuit[0]
        used_clauses += circuit[2]
        used_vars += circuit[1]
        start_var += circuit[1]
    
        return [constraint_string, used_clauses, used_vars, start_var]
    
    # Adjacency conflicts
    # ###################
    def get_all_adjacency_conflicts_4_neighborhood(N, X):
        conflicts = set()
        for x in range(N):
            for y in range(N):
                if x < (N-1):
                    conflicts.add(((x,y),(x+1,y)))
                if y < (N-1):
                    conflicts.add(((x,y),(x,y+1)))
    
        cnf = ''  # slow string appends
        for (var_a, var_b) in conflicts:
            var_a_ = X[var_a]
            var_b_ = X[var_b]
            cnf += '-' + var_a_ + ' ' + '-' + var_b_ + ' 0 \n'
    
        return cnf, len(conflicts)
    
    # Build SAT-CNF
      #############
    def build_cnf(N, verbose=False):
        var_counter = count(1)
        N_CLAUSES = 0
        X = np.zeros((N, N), dtype=object)
        for a in range(N):
            for b in range(N):
                X[a,b] = str(next(var_counter))
    
        # Adjacency constraints
        CNF, N_CLAUSES = get_all_adjacency_conflicts_4_neighborhood(N, X)
    
        # k=2 constraints
        NEXT_VAR = N*N+1
    
        for row in range(N):
            constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_most_2_constraints(X[row, :].astype(int).tolist(), NEXT_VAR)
            N_CLAUSES += used_clauses
            CNF += constraint_string
    
            constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_least_2_constraints(X[row, :].astype(int).tolist(), NEXT_VAR)
            N_CLAUSES += used_clauses
            CNF += constraint_string
    
        for col in range(N):
            constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_most_2_constraints(X[:, col].astype(int).tolist(), NEXT_VAR)
            N_CLAUSES += used_clauses
            CNF += constraint_string
    
            constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_least_2_constraints(X[:, col].astype(int).tolist(), NEXT_VAR)
            N_CLAUSES += used_clauses
            CNF += constraint_string
    
        # build final cnf
        CNF = 'p cnf ' + str(NEXT_VAR-1) + ' ' + str(N_CLAUSES) + '\n' + CNF
    
        return X, CNF, NEXT_VAR-1
    
    
    # External tools
    # ##############
    def get_random_xor_problem(CNF_IN_fp, N_DEC_VARS, N_ALL_VARS, s, min_l, max_l):
        # .cnf not part of arg!
        p = subprocess.Popen(['./gen-wff', CNF_IN_fp,
                              str(N_DEC_VARS), str(N_ALL_VARS),
                              str(s), str(min_l), str(max_l), 'xored'],
                              stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        result = p.communicate()
    
        os.remove(CNF_IN_fp + '-str-xored.xor')  # file not needed
        return CNF_IN_fp + '-str-xored.cnf'
    
    def solve(CNF_IN_fp, N_DEC_VARS):
        seed = cryptogen.randint(0, 2147483647)  # actually no reason to do it; but can't hurt either
        p = subprocess.Popen(["./cryptominisat5", '-t', '4', '-r', str(seed), CNF_IN_fp], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        result = p.communicate()[0]
    
        sat_line = result.find('s SATISFIABLE')
    
        if sat_line != -1:
            # solution found!
            vars = parse_solution(result)[:N_DEC_VARS]
    
            # forbid solution (DeMorgan)
            negated_vars = list(map(lambda x: x*(-1), vars))
            with open(CNF_IN_fp, 'a') as f:
                f.write( (str(negated_vars)[1:-1] + ' 0\n').replace(',', ''))
    
            # assume solve is treating last constraint despite not changing header!
            # solve again
    
            seed = cryptogen.randint(0, 2147483647)
            p = subprocess.Popen(["./cryptominisat5", '-t', '4', '-r', str(seed), CNF_IN_fp], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
            result = p.communicate()[0]
            sat_line = result.find('s SATISFIABLE')
            if sat_line != -1:
                os.remove(CNF_IN_fp)  # not needed anymore
                return True, False, None
            else:
                return True, True, vars
        else:
            return False, False, None
    
    def parse_solution(output):
        # assumes there is one
        vars = []
        for line in output.split("\n"):
            if line:
                if line[0] == 'v':
                    line_vars = list(map(lambda x: int(x), line.split()[1:]))
                    vars.extend(line_vars)
        return vars
    
    # Core-algorithm
    # ##############
    def xorsample(X, CNF_IN_fp, N_DEC_VARS, N_VARS, s, min_l, max_l):
        start_time = time()
        while True:
            # add s random XOR constraints to F
            xored_cnf_fp = get_random_xor_problem(CNF_IN_fp, N_DEC_VARS, N_VARS, s, min_l, max_l)
            state_lvl1, state_lvl2, var_sol = solve(xored_cnf_fp, N_DEC_VARS)
    
            print('------------')
    
            if state_lvl1 and state_lvl2:
                print('FOUND')
    
                d = shelve.open('N_15_70_4_6_TO_PLOT')
                d[str(uuid.uuid4())] = (pickle.dumps(var_sol), time() - start_time)
                d.close()
    
                return True
    
            else:
                if state_lvl1:
                    print('sol not unique')
                else:
                    print('no sol found')
    
            print('------------')
    
    """ Run """
    N = 15
    N_DEC_VARS = N*N
    X, CNF, N_VARS = build_cnf(N)
    
    with open('my_problem.cnf', 'w') as f:
        f.write(CNF)
    
    counter = 0
    while True:
        print('sample: ', counter)
        xorsample(X, 'my_problem', N_DEC_VARS, N_VARS, 70, 4, 6)
        counter += 1
    

    Output will look like (removed some warnings):

    ------------
    no sol found
    ------------
    ------------
    no sol found
    ------------
    ------------
    no sol found
    ------------
    ------------
    sol not unique
    ------------
    ------------
    FOUND
    

    Core: CNF-formulation

    We introduce one variable for every cell of the matrix. N=20 means 400 binary-variables.

    Adjancency:

    Precalculate all symmetry-reduced conflicts and add conflict-clauses.

    Basic theory:

    a -> !b
    <->
    !a v !b (propositional logic)
    

    Row/Col-wise Cardinality:

    This is tough to express in CNF and naive approaches need an exponential number of constraints.

    We use some adder-circuit based encoding (SINZ, Carsten. Towards an optimal CNF encoding of boolean cardinality constraints) which introduces new auxiliary-variables.

    Remark:

    sum(var_set) <= k
    <->
    sum(negated(var_set)) >= len(var_set) - k
    

    These SAT-encodings can be put into exact model-counters (for small N; e.g. < 9). The number of solutions equals Ante's results, which is a strong indication for a correct transformation!

    There are also interesting approximate model-counters (also heavily based on xor-constraints) like approxMC which shows one more thing we can do with the SAT-formulation. But in practice i have not been able to use these (approxMC = autoconf; no comment).

    Other experiments

    I did also build a version using pblib, to use more powerful cardinality-formulations for the SAT-CNF formulation. I did not try to use the C++-based API, but only the reduced pbencoder, which automatically selects some best encoding, which was way worse than my encoding used above (which is best is still a research-problem; often even redundant-constraints can help).

    Empirical analysis

    For the sake of obtaining some sample-size (given my patience), i only computed samples for N=15. In this case we used:

    • N=70 xors
    • L,U = 4,6

    I also computed some samples for N=20 with (100,3,6), but this takes a few mins and we reduced the lower bound!

    Visualization

    Here some animation (strengthening my love-hate relationship with matplotlib):

    Edit: And a (reduced) comparison to brute-force uniform-sampling with N=5 (NXOR,L,U = 4, 10, 30):

    (I have not yet decided on the addition of the plotting-code. It's as ugly as the above one and people might look too much into my statistical shambles; normalizations and co.)

    Theory

    Statistical analysis is probably hard to do as the underlying problem is of such combinatoric nature. It's even not entirely obvious how that final cell-PDF should look like. In the case of N=odd, it's probably non-uniform and looks like a chess-board (i did brute-force check N=5 to observe this).

    One thing we can be sure about (imho): symmetry!

    Given a cell-PDF matrix, we should expect, that the matrix is symmetric (A = A.T). This is checked in the visualization and the euclidean-norm of differences over time is plotted.

    We can do the same on some other observation: observed pairings.

    For N=3, we can observe the following pairs:

    • 0,1
    • 0,2
    • 1,2

    Now we can do this per-row and per-column and should expect symmetry too!

    Sadly, it's probably not easy to say something about the variance and therefore the needed samples to speak about confidence!

    Observation

    According to my simplified perception, current-samples and the cell-PDF look good, although convergence is not achieved yet (or we are far away from uniformity).

    The more important aspect are probably the two norms, nicely decreasing towards 0. (yes; one could tune some algorithm for that by transposing with prob=0.5; but this is not done here as it would defeat it's purpose).

    Potential next steps

    • Tune parameters
    • Check out the approach using #SAT-solvers / Model-counters instead of SAT-solvers
    • Try different CNF-formulations, especially in regards to cardinality-encodings and xor-encodings
      • XorSample is by default using tseitin-like encoding to get around exponentially grow
        • for smaller xors (as used) it might be a good idea to use naive encoding (which propagates faster)
          • XorSample supports that in theory; but the script's work differently in practice
          • Cryptominisat is known for dedicated XOR-handling (as it was build for analyzing cryptography including many xors) and might gain something by naive encoding (as inferring xors from blown-up CNFs is much harder)
    • More statistical-analysis
    • Get rid of XorSample scripts (shell + perl...)

    Summary

    • The approach is very general
    • This code produces feasible samples
    • It should be not hard to prove, that every feasible solution can be sampled
    • Others have proven theoretical guarantees for uniformity for some params
      • does not hold for our params
    • Others have empirically / theoretically analyzed smaller parameters (in use here)

提交回复
热议问题