Checking and preventing similar strings while insertion in MySQL

三世轮回 提交于 2019-12-06 01:23:36

问题


Brief info

I have 3 tables:

Set:

id
name

SetItem:

set_id
item_id
position

TempSet:

id

I have a function that generates new random combinations from Item table. Basically, always after successful generation, I create a new row in Set table, get it's id and add all item ids into SetItem table.

Problem

Every time before generating new combination I truncate the TempSet table, fill new item ids into this table and check for similarity percentage by comparing with previous combinations in SetItem table. if new combination similarity greater or equal to 30%, I need to prevent this combination and re-generate new combination.

Similarity means - existence of elements on previously generated combinations. So, the idea is:

if more than 3 element of newly generated set repeated on some previously generated set, prevent it and try to generate another combination.

Here is function that generates new combinations:

  CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (
    comboSize INT ( 5 ),
    tries INT ( 3 ) 
    ) RETURNS text CHARSET utf8 SQL SECURITY INVOKER BEGIN
    iterat :
    LOOP
        DELETE 
        FROM
            `TempSet`;
        INSERT INTO `TempSet` ( `id` ) (
            SELECT
                `i`.`id` 
            FROM
                `Item` AS `i`
            ORDER BY
                RAND( ) 
                LIMIT comboSize 
            );
        IF
            (
            SELECT
                1 
            FROM
                `SetItem` 
            GROUP BY
                `set_id` 
            HAVING
                sum(
                CASE

                        WHEN EXISTS (
                        SELECT
                            id 
                        FROM
                            `TempSet` 
                        WHERE
                            `id` = `item_id` 
                            LIMIT 1 
                            ) THEN
                            1 ELSE 0 
                        END 
                        ) / count( 1 ) * 100 >= 30 
                        LIMIT 1 
                        ) < 1 THEN
                        RETURN ( SELECT GROUP_CONCAT( id SEPARATOR '-' ) FROM `TempSet` );

                END IF;

                SET tries := tries - 1;
                IF
                    tries = 0 THEN
                        RETURN NULL;

                END IF;

            END LOOP iterat;

END

When I test it, even when newly generated combination's elements doesn't exist in any other previously generated combination, it returns null as a result.

My question is, what am I doing wrong?


回答1:


My question is, what am I doing wrong?

You don't have any data in your SetItem table.

Edit: You commented that this is wrong; you do have 300k rows in SetItem.


I got an example working. It appears that you can't use a scalar subquery like you're doing. I got it working this way:

DROP FUNCTION IF EXISTS gen_uniq_perm_by_kw;
DELIMITER ;;
CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (comboSize INT, tries INT) RETURNS text CHARSET utf8 SQL SECURITY INVOKER
BEGIN
        iterat :
        LOOP
                DELETE FROM `TempSet`;

                INSERT INTO `TempSet` (`id`)
                SELECT `i`.`id` FROM `Item` AS `i` ORDER BY RAND() LIMIT comboSize;

                IF EXISTS(
                        SELECT set_id,
                                SUM(CASE WHEN EXISTS (SELECT id FROM `TempSet` WHERE `id` = `item_id` LIMIT 1) THEN 1 ELSE 0 END) AS group_sum,
                                COUNT(*) AS group_count
                        FROM `SetItem`
                        GROUP BY `set_id`
                        HAVING group_sum * 10 / group_count < 3
                ) THEN
                        RETURN (SELECT GROUP_CONCAT(id SEPARATOR '-') FROM `TempSet`);
                END IF;

                SET tries = tries - 1;

                IF tries = 0 THEN
                        RETURN NULL;
                END IF;
        END LOOP iterat;
END

I also got it working in a simpler way, without using the SUM and extra subquery:

DROP FUNCTION IF EXISTS gen_uniq_perm_by_kw;
DELIMITER ;;
CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (comboSize INT, tries INT) RETURNS text CHARSET utf8 SQL SECURITY INVOKER
BEGIN
        iterat :
        LOOP
                DELETE FROM `TempSet`;

                INSERT INTO `TempSet` (`id`)
                SELECT `i`.`id` FROM `Item` AS `i` ORDER BY RAND() LIMIT comboSize;

                IF EXISTS(
                        SELECT s.set_id,
                                COUNT(t.id) AS group_matches,
                                COUNT(*) AS group_count
                        FROM SetItem AS s LEFT OUTER JOIN TempSet AS t ON t.id = s.item_id
                        GROUP BY s.set_id
                        HAVING group_matches * 10 / group_count < 3
                ) THEN
                        RETURN (SELECT GROUP_CONCAT(id SEPARATOR '-') FROM `TempSet`);
                END IF;

                SET tries = tries - 1;

                IF tries = 0 THEN
                        RETURN NULL;
                END IF;
        END LOOP iterat;
END



回答2:


If you are willing to get a bit loosey-goosey with the "same", consider this alternative:

  • Instead of hashes written as numbers, consider the hast to be a bit position in a bit string.
  • BIT_COUNT(a.bits & b.bits) gives the number of bits that are the same in both bitstrings.

In older versions of MySQL, you are limited to 64-bit BIGINT UNSIGNED strings unless you are willing to have a bunch of them and add up the counts. (I've written code for such.) With newer versions, BLOB can be used with this operation.

Regardless of the hashing (numbers, 0..63, blob), there is a chance of collisions. In many applications, this can be ignored as a minor 'noise' in the system. What is your situation?

What I propose is a lot faster and possibly smaller (data-wise) than the design you outlined.




回答3:


Instead of checking against the MD5 checksum you should check for your 10 fresh generated item_id's wether there are 3 or more occurrences within one set-id.

You must not check:

SELECT  count( * ) 
FROM    `Set` 
WHERE   `Set`.`hash` = @md5 
LIMIT 1 

But instead of that you should check:

select  1
from    setitems
where   item_id in ( a,b,c, put here your 10 fresh generated item )
group by set_id
having  count(1) >= 3
limit   1

This query will return a 1 when there exists a set with 3 or more item_id's in your "in-list".

When the number of items varies (not always 10) then you could also count the items in a group to calculate the percentage:

select  1
from    setitems
group by set_id
having  sum( 
          case when find_in_set(item_id , @list)
          then 1
          else 0
          end
        ) / count(1) * 100 >= 30
limit 1;

@list should be comma-separated https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_find-in-set




回答4:


If you are working with MySQL 5.7 and JSON functions enabled, you can do this:

set @v1 = CONCAT("[", (SELECT group_concat(r.id SEPARATOR ',') FROM (select o.id from test.item o order by rand() limit 10) r), "]");

Now @v1 has a JSON_ARRAY of 10 random item elements.

With this query you will know item collection collission:

select set_id, @v1 as serie, count(*) * 10 as CollisionPercentage from test.setitem 
where JSON_CONTAINS(@v1, JSON_ARRAY(item_id))
group by set_id, serie
order by CollisionPercentage desc;

First CollisionPercentage value determine maximum collision.

EDITED

Try with this. Pay attention with declared database names.

DROP FUNCTION IF EXISTS gen_uniq_perm_by_kw;
DELIMITER ;;

CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (comboSize INT, tries INT, collisions INT) RETURNS text CHARSET utf8 SQL SECURITY INVOKER
BEGIN
        iterat :
        LOOP
                set @v1 = CONCAT(
                    '[', 
                    (SELECT group_concat(r.id SEPARATOR ',') FROM (select o.id from test.item o order by rand() limit comboSize) r), 
                    ']'
                );


                IF EXISTS(
                select set_id, count(*) * 10 as CollisionPercentage from test.setitem 
                    where JSON_CONTAINS(@v1, JSON_ARRAY(item_id))
                    group by set_id
                    having count(*) < collisions
                    order by CollisionPercentage desc
                    LIMIT 1
                ) THEN
                        RETURN @v1;
                END IF;

                SET tries = tries - 1;

                IF tries = 0 THEN
                        RETURN NULL;
                END IF;
        END LOOP iterat;
END;;


DELIMITER ;

select gen_uniq_perm_by_kw(5,5,30);

With result

+--------------------------------------+
| test.gen_uniq_perm_by_kw(5,5,30) |
+--------------------------------------+
| [30111,10916,13446,6617,10918]       |
+--------------------------------------+
1 row in set (0.00 sec)



回答5:


CREATE TABLE `combinations` (
    `id` INT(10) NOT NULL,
    `nb` INT(10) NOT NULL,
    `orderid` INT(10) NOT NULL,
    PRIMARY KEY (`id`, `orderid`),
    INDEX `On_nb` ( `nb`,`id`)
)
COLLATE='utf8_bin'
ENGINE=InnoDB
;


insert into `combinations` values
(1, 13446,1),
(1, 10860,2),
(1, 10885,3),
(1, 10853,4),
(1, 13048,5),
(1, 13044,6),
(1, 10918,7),
(1, 10916,8),
(1, 6519,9),
(1, 10860,10),
(2, 13527,1),
(2, 10933,2),
(2, 10928,3),
(2, 10922,4),
(2, 6595,5),
(2, 10944,6),
(2, 13446,7),
(2, 10860,8),
(2, 10885,9),
(2, 19888,10),
(3, 13364,1),
(3, 12949,2),
(3, 6732,3),
(3, 6763,4),
(3, 13542,5),
(3, 6617,6),
(3, 13125,7),
(3, 13058,8),
(3, 13059,9),
(3, 30111,10);


select c1.id, count(c1.nb) from `combinations`as c1, `combinations` as c_ori
where c1.nb=c_ori.nb and c_ori.id=2 and  c1.id!=c_ori.id
group by c1.id having count(c1.nb)>=3

"id"    "count(c1.nb)"
"1" "4"

As this last query returns something, 2nd combinations already exists with a granularity of at least 30%. Please note that in your first combinations, number 10860 is there twice. This algorithm doesn't take into account properly duplicated numbers. Do you require duplicated numbers in your combinations ?



来源:https://stackoverflow.com/questions/45057324/checking-and-preventing-similar-strings-while-insertion-in-mysql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!