Check existence of distinct values for each group

问题

EDITED:

Suppose I have the following table in MySQL:

CREATE TABLE `events` (
`pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
`time_stamp` bigint(20) UNSIGNED NOT NULL,
`value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
PRIMARY KEY (`pv_name`, `time_stamp`)
) ENGINE=InnoDB;

I can find each pv_name that has more than one distinct value in this table using the following query:

SELECT events.pv_name
FROM events
GROUP BY events.pv_name
HAVING COUNT(DISTINCT events.value) > 1;

The issue is that this query is not efficient. It counts all of the distinct values instead of stopping after finding more than one.

One suggestion has been the following:

SELECT events.pv_name
FROM events
GROUP BY events.pv_name
HAVING MIN(events.value) < MAX(events.value);

This is efficient if the index includes value. However, value is a text column so it can not.

Is there another approach that would make this search more efficient? Some form of correlated subquery perhaps? I would like to stay with MySQL, but if there is a feature in another database server that would help this I might consider moving to it.

回答1:

To answer your question, it is probably best to avoid group by or distinct. First, though, I would suggest adding an auto-incremented event_id for the table. This makes it possible to determine whether or not two rows are the same.

So, I would suggest the following query:

select e.*
from events e
where e.time_stamp between $ts1 and $ts2 and
      exists (select 1
              from events e2
              where e2.pv_name = e.pv_name and
                    e2.time_stamp between $ts1 and $ts2 and
                    e2.event_id < e.event_id
             );

You also want indexes: events(time_stamp, pv_name, event_id) and events(pv_name, time_stamp, event_id).

This finds pairs of events. You can use select distinct pv_name. However, that incurs a bunch of extra processing to remove the duplicates.

回答2:

SELECT * FROM Customers WHERE pv_name IN
(SELECT pv_name FROM Customers GROUP BY pv_name HAVING COUNT(*) > 1) AND
 time_stamp BETWEEN 'start_time' and `end_time'

SELECT * FROM Customers GROUP BY pv_name HAVING MIN(time_stamp ) < MAX(time_stamp ) ;

This may work.

回答3:

I believe the following may work? Can it be improved upon?

-- Chooses a single non null `value` from the `events` table for each `pv_name`.
CREATE TEMPORARY TABLE single_values ( PRIMARY KEY (pv_name) ) ENGINE=Memory AS (
SELECT events.pv_name, events.value
FROM events
WHERE events.value IS NOT NULL
GROUP BY events.pv_name );

-- Finds each `pv_name` that has a `value` different than the one for it in `single_values`.
-- This is a correlated subquery.
SELECT single_values.pv_name
FROM single_values
WHERE 1 = (
SELECT 1
FROM events
WHERE events.pv_name = single_values.pv_name
AND events.value <> single_values.value
AND events.value IS NOT NULL
LIMIT 1 );

来源：https://stackoverflow.com/questions/33226821/check-existence-of-distinct-values-for-each-group

标签

mysql

sql

group-by

query-optimization

distinct