问题
I have a query where I iterate through a table -> for each entry I iterate through another table and then compute some results. I use a cursor for iterating through the table. This query takes ages to complete. Always more than 3 minutes. If I do something similar in C# where the tables are arrays or dictionaries it doesn't even take a second. What am I doing wrong and how can I improve the efficiency?
DELETE FROM [QueryScores]
GO
INSERT INTO [QueryScores] (Id)
SELECT Id FROM [Documents]
DECLARE @Id NVARCHAR(50)
DECLARE myCursor CURSOR LOCAL FAST_FORWARD FOR
SELECT [Id] FROM [QueryScores]
OPEN myCursor
FETCH NEXT FROM myCursor INTO @Id
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @Score FLOAT = 0.0
DECLARE @CounterMax INT = (SELECT COUNT(*) FROM [Query])
DECLARE @Counter INT = 0
PRINT 'Document: ' + CAST(@Id AS VARCHAR)
PRINT 'Score: ' + CAST(@Score AS VARCHAR)
WHILE @Counter < @CounterMax
BEGIN
DECLARE @StemId INT = (SELECT [Query].[StemId] FROM [Query] WHERE [Query].[Id] = @Counter)
DECLARE @Weight FLOAT = (SELECT [tfidf].[Weight] FROM [TfidfWeights] AS [tfidf] WHERE [tfidf].[StemId] = @StemId AND [tfidf].[DocumentId] = @Id)
PRINT 'WEIGHT: ' + CAST(@Weight AS VARCHAR)
IF(@Weight > 0.0)
BEGIN
DECLARE @QWeight FLOAT = (SELECT [Query].[Weight] FROM [Query] WHERE [Query].[StemId] = @StemId)
SET @Score = @Score + (@QWeight * @Weight)
PRINT 'Score: ' + CAST(@Score AS VARCHAR)
END
SET @Counter = @Counter + 1
END
UPDATE [QueryScores] SET Score = @Score WHERE Id = @Id
FETCH NEXT FROM myCursor INTO @Id
END
CLOSE myCursor
DEALLOCATE myCursor
The logic is that i have a list of docs. And I have a question/query. I iterate through each and every doc and then have a nested iteration through the query terms/words to find if the doc contains these terms. If it does then I add/multiply pre-calculated scores.
回答1:
The problem is that you're trying to use a set-based language to iterate through things like a procedural language. SQL requires a different mindset. You should almost never be thinking in terms of loops in SQL.
From what I can gather from your code, this should do what you're trying to do in all of those loops, but it does it in a single statement in a set-based manner, which is what SQL is good at.
INSERT INTO QueryScores (id, score)
SELECT
D.id,
SUM(CASE WHEN W.[Weight] > 0 THEN W.[Weight] * Q.[Weight] ELSE NULL END)
FROM
Documents D
CROSS JOIN Query Q
LEFT OUTER JOIN TfidfWeights W ON W.StemId = Q.StemId AND W.DocumentId = D.id
GROUP BY
D.id
Of course, without a description of your requirements or sample data with expected output I don't know if this is actually what you're looking to get, but it's my best guess given your code.
You should read: https://stackoverflow.com/help/how-to-ask
回答2:
The query I came up with is very similar to the one from Tom H.
There's a lot of unknowns about the problem OP code is trying to solve. Is there a particular reason the code only checks for rows in the Query
table where the Id
value is between 0 and one less than the number of rows in the table? Or is the intent really just to get all of the rows from Query
?
Here's my version:
INSERT INTO QueryScores (Id, Score)
SELECT d.Id
, SUM(CASE WHEN w.Weight > 0 THEN w.Weight * q.Weight ELSE NULL END) AS Score
FROM [Documents] d
CROSS
JOIN [Query] q
LEFT
JOIN [TfidfWeights] w
ON w.StemId = q.StemId
AND w.DocumentId = d.Id
GROUP BY d.Id
Processing RBAR (row by agonizing row) is almost always going to be slower than processing as a set. SQL is designed to operate on sets of data. There is overhead for each individual SQL statement, and for each context switch between the procedure and the SQL engine. Sure, there might be room to improve performance of individual parts of the procedure, but the big gain is going to be doing an operation on the entire set, in a single SQL statement.
If there's some reason you need to process one document at a time, using a cursor, then get rid of the loops and individual selects and all those PRINT, and just use a single query to get the score for the document.
OPEN myCursor
FETCH NEXT FROM myCursor INTO @Id
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE [QueryScores]
SET Score
= ( SELECT SUM( CASE WHEN w.Weight > 0
THEN w.Weight * q.Weight
ELSE NULL END
)
FROM [Query] q
JOIN [TfidfWeights] w
ON w.StemId = q.StemId
WHERE w.DocumentId = @Id
)
WHERE Id = @Id
FETCH NEXT FROM myCursor INTO @Id
END
CLOSE myCursor
DEALLOCATE myCursor
回答3:
You might not even need documents
INSERT INTO QueryScores (id, score)
SELECT W.DocumentId as [id]
, SUM(W.[Weight] + Q.[Weight]) as [score]
FROM Query Q
JOIN TfidfWeights W
ON W.StemId = Q.StemId
AND W.[Weight] > 0
GROUP BY W.DocumentId
来源:https://stackoverflow.com/questions/35990020/sql-query-with-cursor-optimization