问题
Brief info
I have 3 tables:
Set:
id
name
SetItem:
set_id
item_id
position
TempSet:
id
I have a function that generates new random combinations from Item
table. Basically, always after successful generation, I create a new row in Set
table, get it's id and add all item ids into SetItem
table.
Problem
Every time before generating new combination I truncate the TempSet
table, fill new item ids into this table and check for similarity percentage by comparing with previous combinations in SetItem
table. if new combination similarity greater or equal to 30%, I need to prevent this combination and re-generate new combination.
Similarity means - existence of elements on previously generated combinations. So, the idea is:
if more than 3 element of newly generated set repeated on some previously generated set, prevent it and try to generate another combination.
Here is function that generates new combinations:
CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (
comboSize INT ( 5 ),
tries INT ( 3 )
) RETURNS text CHARSET utf8 SQL SECURITY INVOKER BEGIN
iterat :
LOOP
DELETE
FROM
`TempSet`;
INSERT INTO `TempSet` ( `id` ) (
SELECT
`i`.`id`
FROM
`Item` AS `i`
ORDER BY
RAND( )
LIMIT comboSize
);
IF
(
SELECT
1
FROM
`SetItem`
GROUP BY
`set_id`
HAVING
sum(
CASE
WHEN EXISTS (
SELECT
id
FROM
`TempSet`
WHERE
`id` = `item_id`
LIMIT 1
) THEN
1 ELSE 0
END
) / count( 1 ) * 100 >= 30
LIMIT 1
) < 1 THEN
RETURN ( SELECT GROUP_CONCAT( id SEPARATOR '-' ) FROM `TempSet` );
END IF;
SET tries := tries - 1;
IF
tries = 0 THEN
RETURN NULL;
END IF;
END LOOP iterat;
END
When I test it, even when newly generated combination's elements doesn't exist in any other previously generated combination, it returns null as a result.
My question is, what am I doing wrong?
回答1:
My question is, what am I doing wrong?
You don't have any data in your SetItem table.
Edit: You commented that this is wrong; you do have 300k rows in SetItem.
I got an example working. It appears that you can't use a scalar subquery like you're doing. I got it working this way:
DROP FUNCTION IF EXISTS gen_uniq_perm_by_kw;
DELIMITER ;;
CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (comboSize INT, tries INT) RETURNS text CHARSET utf8 SQL SECURITY INVOKER
BEGIN
iterat :
LOOP
DELETE FROM `TempSet`;
INSERT INTO `TempSet` (`id`)
SELECT `i`.`id` FROM `Item` AS `i` ORDER BY RAND() LIMIT comboSize;
IF EXISTS(
SELECT set_id,
SUM(CASE WHEN EXISTS (SELECT id FROM `TempSet` WHERE `id` = `item_id` LIMIT 1) THEN 1 ELSE 0 END) AS group_sum,
COUNT(*) AS group_count
FROM `SetItem`
GROUP BY `set_id`
HAVING group_sum * 10 / group_count < 3
) THEN
RETURN (SELECT GROUP_CONCAT(id SEPARATOR '-') FROM `TempSet`);
END IF;
SET tries = tries - 1;
IF tries = 0 THEN
RETURN NULL;
END IF;
END LOOP iterat;
END
I also got it working in a simpler way, without using the SUM and extra subquery:
DROP FUNCTION IF EXISTS gen_uniq_perm_by_kw;
DELIMITER ;;
CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (comboSize INT, tries INT) RETURNS text CHARSET utf8 SQL SECURITY INVOKER
BEGIN
iterat :
LOOP
DELETE FROM `TempSet`;
INSERT INTO `TempSet` (`id`)
SELECT `i`.`id` FROM `Item` AS `i` ORDER BY RAND() LIMIT comboSize;
IF EXISTS(
SELECT s.set_id,
COUNT(t.id) AS group_matches,
COUNT(*) AS group_count
FROM SetItem AS s LEFT OUTER JOIN TempSet AS t ON t.id = s.item_id
GROUP BY s.set_id
HAVING group_matches * 10 / group_count < 3
) THEN
RETURN (SELECT GROUP_CONCAT(id SEPARATOR '-') FROM `TempSet`);
END IF;
SET tries = tries - 1;
IF tries = 0 THEN
RETURN NULL;
END IF;
END LOOP iterat;
END
回答2:
If you are willing to get a bit loosey-goosey with the "same", consider this alternative:
- Instead of hashes written as numbers, consider the hast to be a bit position in a bit string.
- BIT_COUNT(a.bits & b.bits) gives the number of bits that are the same in both bitstrings.
In older versions of MySQL, you are limited to 64-bit BIGINT UNSIGNED
strings unless you are willing to have a bunch of them and add up the counts. (I've written code for such.) With newer versions, BLOB
can be used with this operation.
Regardless of the hashing (numbers, 0..63, blob), there is a chance of collisions. In many applications, this can be ignored as a minor 'noise' in the system. What is your situation?
What I propose is a lot faster and possibly smaller (data-wise) than the design you outlined.
回答3:
Instead of checking against the MD5 checksum you should check for your 10 fresh generated item_id's wether there are 3 or more occurrences within one set-id.
You must not check:
SELECT count( * )
FROM `Set`
WHERE `Set`.`hash` = @md5
LIMIT 1
But instead of that you should check:
select 1
from setitems
where item_id in ( a,b,c, put here your 10 fresh generated item )
group by set_id
having count(1) >= 3
limit 1
This query will return a 1 when there exists a set with 3 or more item_id's in your "in-list".
When the number of items varies (not always 10) then you could also count the items in a group to calculate the percentage:
select 1
from setitems
group by set_id
having sum(
case when find_in_set(item_id , @list)
then 1
else 0
end
) / count(1) * 100 >= 30
limit 1;
@list should be comma-separated https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_find-in-set
回答4:
If you are working with MySQL 5.7 and JSON functions enabled, you can do this:
set @v1 = CONCAT("[", (SELECT group_concat(r.id SEPARATOR ',') FROM (select o.id from test.item o order by rand() limit 10) r), "]");
Now @v1 has a JSON_ARRAY of 10 random item elements.
With this query you will know item collection collission:
select set_id, @v1 as serie, count(*) * 10 as CollisionPercentage from test.setitem
where JSON_CONTAINS(@v1, JSON_ARRAY(item_id))
group by set_id, serie
order by CollisionPercentage desc;
First CollisionPercentage value determine maximum collision.
EDITED
Try with this. Pay attention with declared database names.
DROP FUNCTION IF EXISTS gen_uniq_perm_by_kw;
DELIMITER ;;
CREATE DEFINER = `root` @`localhost` FUNCTION `gen_uniq_perm_by_kw` (comboSize INT, tries INT, collisions INT) RETURNS text CHARSET utf8 SQL SECURITY INVOKER
BEGIN
iterat :
LOOP
set @v1 = CONCAT(
'[',
(SELECT group_concat(r.id SEPARATOR ',') FROM (select o.id from test.item o order by rand() limit comboSize) r),
']'
);
IF EXISTS(
select set_id, count(*) * 10 as CollisionPercentage from test.setitem
where JSON_CONTAINS(@v1, JSON_ARRAY(item_id))
group by set_id
having count(*) < collisions
order by CollisionPercentage desc
LIMIT 1
) THEN
RETURN @v1;
END IF;
SET tries = tries - 1;
IF tries = 0 THEN
RETURN NULL;
END IF;
END LOOP iterat;
END;;
DELIMITER ;
select gen_uniq_perm_by_kw(5,5,30);
With result
+--------------------------------------+
| test.gen_uniq_perm_by_kw(5,5,30) |
+--------------------------------------+
| [30111,10916,13446,6617,10918] |
+--------------------------------------+
1 row in set (0.00 sec)
回答5:
CREATE TABLE `combinations` (
`id` INT(10) NOT NULL,
`nb` INT(10) NOT NULL,
`orderid` INT(10) NOT NULL,
PRIMARY KEY (`id`, `orderid`),
INDEX `On_nb` ( `nb`,`id`)
)
COLLATE='utf8_bin'
ENGINE=InnoDB
;
insert into `combinations` values
(1, 13446,1),
(1, 10860,2),
(1, 10885,3),
(1, 10853,4),
(1, 13048,5),
(1, 13044,6),
(1, 10918,7),
(1, 10916,8),
(1, 6519,9),
(1, 10860,10),
(2, 13527,1),
(2, 10933,2),
(2, 10928,3),
(2, 10922,4),
(2, 6595,5),
(2, 10944,6),
(2, 13446,7),
(2, 10860,8),
(2, 10885,9),
(2, 19888,10),
(3, 13364,1),
(3, 12949,2),
(3, 6732,3),
(3, 6763,4),
(3, 13542,5),
(3, 6617,6),
(3, 13125,7),
(3, 13058,8),
(3, 13059,9),
(3, 30111,10);
select c1.id, count(c1.nb) from `combinations`as c1, `combinations` as c_ori
where c1.nb=c_ori.nb and c_ori.id=2 and c1.id!=c_ori.id
group by c1.id having count(c1.nb)>=3
"id" "count(c1.nb)"
"1" "4"
As this last query returns something, 2nd combinations already exists with a granularity of at least 30%. Please note that in your first combinations, number 10860 is there twice. This algorithm doesn't take into account properly duplicated numbers. Do you require duplicated numbers in your combinations ?
来源:https://stackoverflow.com/questions/45057324/checking-and-preventing-similar-strings-while-insertion-in-mysql