percentile by COUNT(DISTINCT) with correlated WHERE only works with a view (or without DISTINCT)

I've got a weird one, and I don't know if it's my syntax (which seems straightforward) or a bug (or just unsupported).

Here's my query that works but is needlessly slow:

UPDATE table1 
    SET table1column1 = 
        (SELECT COUNT(DISTINCT table2column1) FROM table2view WHERE table2column1 <= (SELECT table2column1 FROM table2 WHERE table2.id = table1.id) )
        / 
        (SELECT COUNT(DISTINCT table2column1) FROM table2) 

       + (SELECT COUNT(DISTINCT table2column2) FROM table2view WHERE table2column2 <= (SELECT table2column2 FROM table2 WHERE table2.id = table1.id) ) 
        / 
        (SELECT COUNT(DISTINCT table2column2) FROM table2) 

       + (SELECT COUNT(DISTINCT table2column3) FROM table2view WHERE table2column3 <= (SELECT table2column3 FROM table2 WHERE table2.id = table1.id) ) 
        / (SELECT COUNT(DISTINCT table2column3) FROM table2);

It's just the sum of three percentiles (of table2column1, table2column2, and table2column3) with duplicates removed.

Here's where it gets weird. I have to use a view for this to work on the subquery with the WHERE or it will only UPDATE the first row of table1, and set the rest of the rows' table1column1 to 0. That table2view is an exact duplicate of table2. Yeah, weird.

If I don't use DISTINCT, I can do it without the view. Does that make sense? Note: I have to have DISTINCT because I have lots of duplicates.

I tried making it SELECT only from the view, but that slowed it down worse.

Does anyone know what the problem is and the best way to rework this query so it doesn't take so long? It's in a TRIGGER, and the updated data is pretty on demand.

Many thanks in advance!

Details

I'm testing the speed in phpMyAdmin's command line.

I'm pretty sure the degradation is coming from the view since the more of the view and the less of the actual table I use, the slower it gets.

When I do the one without DISTINCT, it's lightning fast.

Only works on views?

OK, so I just set up a copy of table2. I tried first to do the original query substituting the view with the copy. No go.

I tried to do the query below with the copy instead of the view. No go.

Hopefully the introduction of these constants will better show what I'm trying to do.

SET @table2column1_distinct_count = (SELECT COUNT(DISTINCT table2column1) FROM table2);
SET @table2column2_distinct_count = (SELECT COUNT(DISTINCT table2column2) FROM table2);
SET @table2column3_distinct_count = (SELECT COUNT(DISTINCT table2column3) FROM table2);
UPDATE table1, table2
    SET table1.table1column1 = (SELECT COUNT(DISTINCT table2column1) FROM  table2view WHERE table2column1 <= table2.table2column1) / @table2column1_distinct_count 
    + (SELECT COUNT(DISTINCT table2column2) FROM  table2view WHERE table2column2 <= table2.table2column2) / @table2column2_distinct_count 
    + (SELECT COUNT(DISTINCT table2column3) FROM  table2view WHERE table2column3 <= table2.table2column3) / @table2column3_distinct_count 
        WHERE table1.id = table2.id;

Again, when I use table2 instead of the table2view, it only updates the first row properly and sets all other rows' table1.table1column1 = 0.

Math

I'm trying to set table1.table1column1 = to the sum of the percentiles of table2column1, table2column2, and table2column3 by id.

I do a percentile by (counting the distinct values of a table2columnX <= to the current table2columnX ) / (the total count of distinct table2columnXs).

I use DISTINCT to get rid of the excessive duplicates.

View

Here's the SELECT for the view. Does this help?

CREATE VIEW myTable.table2view AS SELECT
    table2.table2column1 AS table2column1,
    table2.table2column2 AS table2column2,
    table2.table2column2 AS table2column3,
FROM table2
GROUP BY table2.id;

Is there something special about the GROUP BY in the view's SELECT that makes this work (that I'm not seeing)?

I would probably say that the query is slow because it is repeatedly accessing the table when the trigger fires.

I am no SQL expert but I have tried to put together a query using temporary tables. You can see if it helps speed up the query. I have used different but similar sounding column names in my code sample below.

EDIT : There was a calculation error in my earlier code. Updated now.

SELECT COUNT(id) INTO @no_of_attempts from tb2;

-- DROP TABLE IF EXISTS S1Percentiles;
-- DROP TABLE IF EXISTS S2Percentiles;
-- DROP TABLE IF EXISTS S3Percentiles;

CREATE TEMPORARY TABLE S1Percentiles (
    s1 FLOAT NOT NULL,
    percentile FLOAT NOT NULL DEFAULT 0.00
);

CREATE TEMPORARY TABLE S2Percentiles (
    s2 FLOAT NOT NULL,
    percentile FLOAT NOT NULL DEFAULT 0.00
);

CREATE TEMPORARY TABLE S3Percentiles (
    s3 FLOAT NOT NULL,
    percentile FLOAT NOT NULL DEFAULT 0.00
);



INSERT INTO S1Percentiles (s1, percentile)
    SELECT A.s1, ((COUNT(B.s1)/@no_of_attempts)*100)
    FROM (SELECT DISTINCT s1 from tb2) A
    INNER JOIN tb2 B
    ON B.s1 <= A.s1
    GROUP BY A.s1;

INSERT INTO S2Percentiles (s2, percentile)
    SELECT A.s2, ((COUNT(B.s2)/@no_of_attempts)*100)
    FROM (SELECT DISTINCT s2 from tb2) A
    INNER JOIN tb2 B
    ON B.s2 <= A.s2
    GROUP BY A.s2;

INSERT INTO S3Percentiles (s3, percentile)
    SELECT A.s3, ((COUNT(B.s3)/@no_of_attempts)*100)
    FROM (SELECT DISTINCT s3 from tb2) A
    INNER JOIN tb2 B
    ON B.s3 <= A.s3
    GROUP BY A.s3;

-- select * from S1Percentiles;
-- select * from S2Percentiles;
-- select * from S3Percentiles;

UPDATE tb1 A
    INNER JOIN
    (
    SELECT B.tb1_id AS id, (C.percentile + D.percentile + E.percentile) AS sum FROM tb2 B
        INNER JOIN S1Percentiles C
        ON B.s1 = C.s1
        INNER JOIN S2Percentiles D
        ON B.s2 = D.s2
        INNER JOIN S3Percentiles E
        ON B.s3 = E.s3
    ) F
    ON A.id = F.id

    SET A.sum = F.sum;

-- SELECT * FROM tb1;

DROP TABLE S1Percentiles;
DROP TABLE S2Percentiles;
DROP TABLE S3Percentiles;

What this does is that it records the percentile for each score group and then finally just updates the tb1 column with the requisite data instead of recalculating the percentile for each student row.

You should also index columns s1, s2 and s3 for optimizing the queries on these columns.

Note: Please update the column names according to your db schema. Also note that each percentile calculation has been multiplied by 100 as I believe that percentile is usually calculated that way.

来源：https://stackoverflow.com/questions/14097690/percentile-by-countdistinct-with-correlated-where-only-works-with-a-view-or-w

标签

mysql

view

count

distinct

where-clause