SQL Query with Cursor optimization

问题

I have a query where I iterate through a table -> for each entry I iterate through another table and then compute some results. I use a cursor for iterating through the table. This query takes ages to complete. Always more than 3 minutes. If I do something similar in C# where the tables are arrays or dictionaries it doesn't even take a second. What am I doing wrong and how can I improve the efficiency?

DELETE FROM [QueryScores]
GO

INSERT INTO [QueryScores] (Id)
SELECT Id FROM [Documents]

DECLARE @Id NVARCHAR(50)

DECLARE myCursor CURSOR LOCAL FAST_FORWARD FOR
SELECT [Id] FROM [QueryScores]

OPEN myCursor

FETCH NEXT FROM myCursor INTO @Id

WHILE @@FETCH_STATUS = 0
    BEGIN
        DECLARE @Score FLOAT = 0.0

        DECLARE @CounterMax INT = (SELECT COUNT(*) FROM [Query])
        DECLARE @Counter INT = 0

        PRINT 'Document: ' + CAST(@Id AS VARCHAR)
        PRINT 'Score: ' + CAST(@Score AS VARCHAR)

        WHILE @Counter < @CounterMax
            BEGIN

            DECLARE @StemId INT = (SELECT [Query].[StemId] FROM [Query] WHERE [Query].[Id] = @Counter)

            DECLARE @Weight FLOAT = (SELECT [tfidf].[Weight] FROM [TfidfWeights] AS [tfidf] WHERE [tfidf].[StemId] = @StemId AND [tfidf].[DocumentId] = @Id)

            PRINT 'WEIGHT: ' + CAST(@Weight AS VARCHAR)

            IF(@Weight > 0.0)
                BEGIN
                DECLARE @QWeight FLOAT = (SELECT [Query].[Weight] FROM [Query] WHERE [Query].[StemId] = @StemId)
                SET @Score = @Score + (@QWeight * @Weight)
                PRINT 'Score: ' + CAST(@Score AS VARCHAR)
                END

            SET @Counter = @Counter + 1
            END 

        UPDATE [QueryScores] SET Score = @Score WHERE Id = @Id 

        FETCH NEXT FROM myCursor INTO @Id
    END

CLOSE myCursor
DEALLOCATE myCursor

The logic is that i have a list of docs. And I have a question/query. I iterate through each and every doc and then have a nested iteration through the query terms/words to find if the doc contains these terms. If it does then I add/multiply pre-calculated scores.

回答1:

The problem is that you're trying to use a set-based language to iterate through things like a procedural language. SQL requires a different mindset. You should almost never be thinking in terms of loops in SQL.

From what I can gather from your code, this should do what you're trying to do in all of those loops, but it does it in a single statement in a set-based manner, which is what SQL is good at.

INSERT INTO QueryScores (id, score)
SELECT
    D.id,
    SUM(CASE WHEN W.[Weight] > 0 THEN W.[Weight] * Q.[Weight] ELSE NULL END)
FROM
    Documents D
CROSS JOIN Query Q
LEFT OUTER JOIN TfidfWeights W ON W.StemId = Q.StemId AND W.DocumentId = D.id
GROUP BY
    D.id

Of course, without a description of your requirements or sample data with expected output I don't know if this is actually what you're looking to get, but it's my best guess given your code.

You should read: https://stackoverflow.com/help/how-to-ask

回答2:

The query I came up with is very similar to the one from Tom H.

There's a lot of unknowns about the problem OP code is trying to solve. Is there a particular reason the code only checks for rows in the Query table where the Id value is between 0 and one less than the number of rows in the table? Or is the intent really just to get all of the rows from Query?

Here's my version:

INSERT INTO QueryScores (Id, Score)
SELECT d.Id
     , SUM(CASE WHEN w.Weight > 0 THEN w.Weight * q.Weight ELSE NULL END) AS Score
  FROM [Documents] d
 CROSS
  JOIN [Query] q
  LEFT
  JOIN [TfidfWeights] w
    ON w.StemId = q.StemId
   AND w.DocumentId = d.Id
 GROUP BY d.Id

Processing RBAR (row by agonizing row) is almost always going to be slower than processing as a set. SQL is designed to operate on sets of data. There is overhead for each individual SQL statement, and for each context switch between the procedure and the SQL engine. Sure, there might be room to improve performance of individual parts of the procedure, but the big gain is going to be doing an operation on the entire set, in a single SQL statement.

If there's some reason you need to process one document at a time, using a cursor, then get rid of the loops and individual selects and all those PRINT, and just use a single query to get the score for the document.

OPEN myCursor
FETCH NEXT FROM myCursor INTO @Id
WHILE @@FETCH_STATUS = 0
  BEGIN
    UPDATE [QueryScores] 
       SET Score 
         = (  SELECT SUM( CASE WHEN w.Weight > 0 
                               THEN w.Weight * q.Weight 
                               ELSE NULL END
                     )
                FROM [Query] q
                JOIN [TfidfWeights] w
                  ON w.StemId = q.StemId
               WHERE w.DocumentId = @Id
           )
     WHERE Id = @Id

    FETCH NEXT FROM myCursor INTO @Id

  END
CLOSE myCursor
DEALLOCATE myCursor

回答3:

You might not even need documents

INSERT INTO QueryScores (id, score)
SELECT W.DocumentId as [id]
     , SUM(W.[Weight] + Q.[Weight]) as [score]  
  FROM Query Q
  JOIN TfidfWeights W 
         ON W.StemId = Q.StemId 
        AND W.[Weight] > 0 
 GROUP BY W.DocumentId

来源：https://stackoverflow.com/questions/35990020/sql-query-with-cursor-optimization

标签

sql

sql-server

tsql