Algorithm for equiprobable random square binary matrices with two non-adjacent non-zeros in each row and column

后端未结

关注

 5  1281

遇见更好的自我 2020-12-09 10:22

It would be great if someone could point me towards an algorithm that would allow me to :

create a random square matrix, with entries 0 and 1, such that

5条回答

执笔经年 (楼主)

2020-12-09 11:04

Intro

Here is some prototype-approach, trying to solve the more general task of uniform combinatorial sampling, which for our approach here means: we can use this approach for everything which we can formulate as SAT-problem.

It's not exploiting your problem directly and takes a heavy detour. This detour to the SAT-problem can help in regards to theory (more powerful general theoretical results) and efficiency (SAT-solvers).

That being said, it's not an approach if you want to sample within seconds or less (in my experiments), at least while being concerned about uniformity.

Theory

The approach, based on results from complexity-theory, follows this work:

GOMES, Carla P.; SABHARWAL, Ashish; SELMAN, Bart. Near-uniform sampling of combinatorial spaces using XOR constraints. In: Advances In Neural Information Processing Systems. 2007. S. 481-488.

The basic idea:

formulate the problem as SAT-problem
add randomly generated xors to the problem (acting on the decision-variables only! that's important in practice)
- this will reduce the number of solutions (some solutions will get impossible)
- do that in a loop (with tuned parameters) until only one solution is left!
  - search for some solution is being done by SAT-solvers or #SAT-solvers (=model-counting)
  - if there is more than one solution: no xors will be added but a complete restart will be done: add random-xors to the start-problem!

The guarantees:

when tuning the parameters right, this approach achieves near-uniform sampling
- this tuning can be costly, as it's based on approximating the number of possible solutions
- empirically this can also be costly!
- Ante's answer, mentioning the number sequence A001499 actually gives a nice upper bound on the solution-space (as it's just ignoring adjacency-constraints!)

The drawbacks:

inefficient for large problems (in general; not necessarily compared to the alternatives like MCMC and co.)
- need to change / reduce parameters to produce samples
- those reduced parameters lose the theoretical guarantees
- but empirically: good results are still possible!

Parameters:

In practice, the parameters are:

N: number of xors added
L: minimum number of variables part of one xor-constraint
U: maximum number of variables part of one xor-constraint

N is important to reduce the number of possible solutions. Given N constant, the other variables of course also have some effect on that.

Theory says (if i interpret correctly), that we should use L = R = 0.5 * #dec-vars.

This is impossible in practice here, as xor-constraints hurt SAT-solvers a lot!

Here some more scientific slides about the impact of L and U.

They call xors of size 8-20 short-XORS, while we will need to use even shorter ones later!

Implementation

Final version

Here is a pretty hacky implementation in python, using the XorSample scripts from here.

The underlying SAT-solver in use is Cryptominisat.

The code basically boils down to:

Transform the problem to conjunctive normal-form
- as DIMACS-CNF
Implement the sampling-approach:
- Calls XorSample (pipe-based + file-based)
- Call SAT-solver (file-based)
Add samples to some file for later analysis

Code: (i hope i did warn you already about the code-quality)

from itertools import count
from time import time
import subprocess
import numpy as np
import os
import shelve
import uuid
import pickle
from random import SystemRandom
cryptogen = SystemRandom()

""" Helper functions """
# K-ARY CONSTRAINT GENERATION
# ###########################
# SINZ, Carsten. Towards an optimal CNF encoding of boolean cardinality constraints.
# CP, 2005, 3709. Jg., S. 827-831.

def next_var_index(start):
    next_var = start
    while(True):
        yield next_var
        next_var += 1

class s_index():
    def __init__(self, start_index):
        self.firstEnvVar = start_index

    def next(self,i,j,k):
        return self.firstEnvVar + i*k +j

def gen_seq_circuit(k, input_indices, next_var_index_gen):
    cnf_string = ''
    s_index_gen = s_index(next_var_index_gen.next())

    # write clauses of first partial sum (i.e. i=0)
    cnf_string += (str(-input_indices[0]) + ' ' + str(s_index_gen.next(0,0,k)) + ' 0\n')
    for i in range(1, k):
        cnf_string += (str(-s_index_gen.next(0, i, k)) + ' 0\n')

    # write clauses for general case (i.e. 0 < i < n-1)
    for i in range(1, len(input_indices)-1):
        cnf_string += (str(-input_indices[i]) + ' ' + str(s_index_gen.next(i, 0, k)) + ' 0\n')
        cnf_string += (str(-s_index_gen.next(i-1, 0, k)) + ' ' + str(s_index_gen.next(i, 0, k)) + ' 0\n')
        for u in range(1, k):
            cnf_string += (str(-input_indices[i]) + ' ' + str(-s_index_gen.next(i-1, u-1, k)) + ' ' + str(s_index_gen.next(i, u, k)) + ' 0\n')
            cnf_string += (str(-s_index_gen.next(i-1, u, k)) + ' ' + str(s_index_gen.next(i, u, k)) + ' 0\n')
        cnf_string += (str(-input_indices[i]) + ' ' + str(-s_index_gen.next(i-1, k-1, k)) + ' 0\n')

    # last clause for last variable
    cnf_string += (str(-input_indices[-1]) + ' ' + str(-s_index_gen.next(len(input_indices)-2, k-1, k)) + ' 0\n')

    return (cnf_string, (len(input_indices)-1)*k, 2*len(input_indices)*k + len(input_indices) - 3*k - 1)

# K=2 clause GENERATION
# #####################
def gen_at_most_2_constraints(vars, start_var):
    constraint_string = ''
    used_clauses = 0
    used_vars = 0
    index_gen = next_var_index(start_var)
    circuit = gen_seq_circuit(2, vars, index_gen)
    constraint_string += circuit[0]
    used_clauses += circuit[2]
    used_vars += circuit[1]
    start_var += circuit[1]

    return [constraint_string, used_clauses, used_vars, start_var]

def gen_at_least_2_constraints(vars, start_var):
    k = len(vars) - 2
    vars = [-var for var in vars]

    constraint_string = ''
    used_clauses = 0
    used_vars = 0
    index_gen = next_var_index(start_var)
    circuit = gen_seq_circuit(k, vars, index_gen)
    constraint_string += circuit[0]
    used_clauses += circuit[2]
    used_vars += circuit[1]
    start_var += circuit[1]

    return [constraint_string, used_clauses, used_vars, start_var]

# Adjacency conflicts
# ###################
def get_all_adjacency_conflicts_4_neighborhood(N, X):
    conflicts = set()
    for x in range(N):
        for y in range(N):
            if x < (N-1):
                conflicts.add(((x,y),(x+1,y)))
            if y < (N-1):
                conflicts.add(((x,y),(x,y+1)))

    cnf = ''  # slow string appends
    for (var_a, var_b) in conflicts:
        var_a_ = X[var_a]
        var_b_ = X[var_b]
        cnf += '-' + var_a_ + ' ' + '-' + var_b_ + ' 0 \n'

    return cnf, len(conflicts)

# Build SAT-CNF
  #############
def build_cnf(N, verbose=False):
    var_counter = count(1)
    N_CLAUSES = 0
    X = np.zeros((N, N), dtype=object)
    for a in range(N):
        for b in range(N):
            X[a,b] = str(next(var_counter))

    # Adjacency constraints
    CNF, N_CLAUSES = get_all_adjacency_conflicts_4_neighborhood(N, X)

    # k=2 constraints
    NEXT_VAR = N*N+1

    for row in range(N):
        constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_most_2_constraints(X[row, :].astype(int).tolist(), NEXT_VAR)
        N_CLAUSES += used_clauses
        CNF += constraint_string

        constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_least_2_constraints(X[row, :].astype(int).tolist(), NEXT_VAR)
        N_CLAUSES += used_clauses
        CNF += constraint_string

    for col in range(N):
        constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_most_2_constraints(X[:, col].astype(int).tolist(), NEXT_VAR)
        N_CLAUSES += used_clauses
        CNF += constraint_string

        constraint_string, used_clauses, used_vars, NEXT_VAR = gen_at_least_2_constraints(X[:, col].astype(int).tolist(), NEXT_VAR)
        N_CLAUSES += used_clauses
        CNF += constraint_string

    # build final cnf
    CNF = 'p cnf ' + str(NEXT_VAR-1) + ' ' + str(N_CLAUSES) + '\n' + CNF

    return X, CNF, NEXT_VAR-1


# External tools
# ##############
def get_random_xor_problem(CNF_IN_fp, N_DEC_VARS, N_ALL_VARS, s, min_l, max_l):
    # .cnf not part of arg!
    p = subprocess.Popen(['./gen-wff', CNF_IN_fp,
                          str(N_DEC_VARS), str(N_ALL_VARS),
                          str(s), str(min_l), str(max_l), 'xored'],
                          stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    result = p.communicate()

    os.remove(CNF_IN_fp + '-str-xored.xor')  # file not needed
    return CNF_IN_fp + '-str-xored.cnf'

def solve(CNF_IN_fp, N_DEC_VARS):
    seed = cryptogen.randint(0, 2147483647)  # actually no reason to do it; but can't hurt either
    p = subprocess.Popen(["./cryptominisat5", '-t', '4', '-r', str(seed), CNF_IN_fp], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    result = p.communicate()[0]

    sat_line = result.find('s SATISFIABLE')

    if sat_line != -1:
        # solution found!
        vars = parse_solution(result)[:N_DEC_VARS]

        # forbid solution (DeMorgan)
        negated_vars = list(map(lambda x: x*(-1), vars))
        with open(CNF_IN_fp, 'a') as f:
            f.write( (str(negated_vars)[1:-1] + ' 0\n').replace(',', ''))

        # assume solve is treating last constraint despite not changing header!
        # solve again

        seed = cryptogen.randint(0, 2147483647)
        p = subprocess.Popen(["./cryptominisat5", '-t', '4', '-r', str(seed), CNF_IN_fp], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        result = p.communicate()[0]
        sat_line = result.find('s SATISFIABLE')
        if sat_line != -1:
            os.remove(CNF_IN_fp)  # not needed anymore
            return True, False, None
        else:
            return True, True, vars
    else:
        return False, False, None

def parse_solution(output):
    # assumes there is one
    vars = []
    for line in output.split("\n"):
        if line:
            if line[0] == 'v':
                line_vars = list(map(lambda x: int(x), line.split()[1:]))
                vars.extend(line_vars)
    return vars

# Core-algorithm
# ##############
def xorsample(X, CNF_IN_fp, N_DEC_VARS, N_VARS, s, min_l, max_l):
    start_time = time()
    while True:
        # add s random XOR constraints to F
        xored_cnf_fp = get_random_xor_problem(CNF_IN_fp, N_DEC_VARS, N_VARS, s, min_l, max_l)
        state_lvl1, state_lvl2, var_sol = solve(xored_cnf_fp, N_DEC_VARS)

        print('------------')

        if state_lvl1 and state_lvl2:
            print('FOUND')

            d = shelve.open('N_15_70_4_6_TO_PLOT')
            d[str(uuid.uuid4())] = (pickle.dumps(var_sol), time() - start_time)
            d.close()

            return True

        else:
            if state_lvl1:
                print('sol not unique')
            else:
                print('no sol found')

        print('------------')

""" Run """
N = 15
N_DEC_VARS = N*N
X, CNF, N_VARS = build_cnf(N)

with open('my_problem.cnf', 'w') as f:
    f.write(CNF)

counter = 0
while True:
    print('sample: ', counter)
    xorsample(X, 'my_problem', N_DEC_VARS, N_VARS, 70, 4, 6)
    counter += 1

Output will look like (removed some warnings):

------------
no sol found
------------
------------
no sol found
------------
------------
no sol found
------------
------------
sol not unique
------------
------------
FOUND

Core: CNF-formulation

We introduce one variable for every cell of the matrix. N=20 means 400 binary-variables.

Adjancency:

Precalculate all symmetry-reduced conflicts and add conflict-clauses.

Basic theory:

a -> !b
<->
!a v !b (propositional logic)

Row/Col-wise Cardinality:

This is tough to express in CNF and naive approaches need an exponential number of constraints.

We use some adder-circuit based encoding (SINZ, Carsten. Towards an optimal CNF encoding of boolean cardinality constraints) which introduces new auxiliary-variables.

Remark:

sum(var_set) <= k
<->
sum(negated(var_set)) >= len(var_set) - k

These SAT-encodings can be put into exact model-counters (for small N; e.g. < 9). The number of solutions equals Ante's results, which is a strong indication for a correct transformation!

There are also interesting approximate model-counters (also heavily based on xor-constraints) like approxMC which shows one more thing we can do with the SAT-formulation. But in practice i have not been able to use these (approxMC = autoconf; no comment).

Other experiments

I did also build a version using pblib, to use more powerful cardinality-formulations for the SAT-CNF formulation. I did not try to use the C++-based API, but only the reduced pbencoder, which automatically selects some best encoding, which was way worse than my encoding used above (which is best is still a research-problem; often even redundant-constraints can help).

Empirical analysis

For the sake of obtaining some sample-size (given my patience), i only computed samples for N=15. In this case we used:

N=70 xors
L,U = 4,6

I also computed some samples for N=20 with (100,3,6), but this takes a few mins and we reduced the lower bound!

Visualization

Here some animation (strengthening my love-hate relationship with matplotlib):

Edit: And a (reduced) comparison to brute-force uniform-sampling with N=5 (NXOR,L,U = 4, 10, 30):

(I have not yet decided on the addition of the plotting-code. It's as ugly as the above one and people might look too much into my statistical shambles; normalizations and co.)

Theory

Statistical analysis is probably hard to do as the underlying problem is of such combinatoric nature. It's even not entirely obvious how that final cell-PDF should look like. In the case of N=odd, it's probably non-uniform and looks like a chess-board (i did brute-force check N=5 to observe this).

One thing we can be sure about (imho): symmetry!

Given a cell-PDF matrix, we should expect, that the matrix is symmetric (A = A.T). This is checked in the visualization and the euclidean-norm of differences over time is plotted.

We can do the same on some other observation: observed pairings.

For N=3, we can observe the following pairs:

Now we can do this per-row and per-column and should expect symmetry too!

Sadly, it's probably not easy to say something about the variance and therefore the needed samples to speak about confidence!

Observation

According to my simplified perception, current-samples and the cell-PDF look good, although convergence is not achieved yet (or we are far away from uniformity).

The more important aspect are probably the two norms, nicely decreasing towards 0. (yes; one could tune some algorithm for that by transposing with prob=0.5; but this is not done here as it would defeat it's purpose).

Potential next steps

Tune parameters
Check out the approach using #SAT-solvers / Model-counters instead of SAT-solvers
Try different CNF-formulations, especially in regards to cardinality-encodings and xor-encodings
- XorSample is by default using tseitin-like encoding to get around exponentially grow
  - for smaller xors (as used) it might be a good idea to use naive encoding (which propagates faster)
    - XorSample supports that in theory; but the script's work differently in practice
    - Cryptominisat is known for dedicated XOR-handling (as it was build for analyzing cryptography including many xors) and might gain something by naive encoding (as inferring xors from blown-up CNFs is much harder)
More statistical-analysis
Get rid of XorSample scripts (shell + perl...)

Summary

The approach is very general
This code produces feasible samples
It should be not hard to prove, that every feasible solution can be sampled
Others have proven theoretical guarantees for uniformity for some params
- does not hold for our params
Others have empirically / theoretically analyzed smaller parameters (in use here)

0 讨论(0)

查看其它5个回答