问题
I'm looking to calculate statistics for a large data set on Excel and encountering some issues due to data set size.
It seems VBA may be the way to go, as copying AVERAGEIF and STDDEV array functions across data this size is causing long calculation times. Appreciate possible solutions or code that could be used here.
Goals:
- To calculate statistics (avg, std dev, z-scores) conditional on 2 identifiers (e.g. average of all heights at 01/01/10)
- Able to handle large data sets (100k+ data points)
Sample Data:
Date | User ID | Indicator | Data Point
01/01/10| 1 | Height | 150
01/01/10| 1 | Weight | 123
01/01/10| 2 | Height | 146
01/01/10| 2 | Weight | 123
01/02/10| 1 | Height | 156
01/02/10| 1 | Weight | 160
01/02/10| 2 | Height | 103
01/02/10| 2 | Weight | 109
Edit:
Expected output would ideally be as z-scores for each data point in a new column. Example: first z-score would be normalized for all heights on 01/01/10 with:
(150 - avg) / stdev
回答1:
I don't know what is a z-score, since I'm getting same (+/-) value for all data points. But I'm confident that you will be able to modify the code to get what you want. Data is supposed to be located in sheet "Data", in which there is a command button named Go for the code to execute. Beware! Code is clearing all content from column E onwards.
Dim lLastRowDB As Long
Dim dU1 As Object, cU1 As Variant, iU1 As Long, lrU As Long
Dim dU2 As Object, cU2 As Variant, iU2 As Long
Dim MyArray() As Variant
Dim lAV As Double
Dim lSD As Double
Dim i As Integer
Dim j As Integer
Dim k As Integer
Private Sub Go_Click()
Worksheets("Data").Columns("E:EZ").Delete Shift:=xlToLeft 'Clear previous results
lLastRowDB = Worksheets("Data").Cells(2, 1).End(xlDown).Row 'Assuming your data starts in A2
'Indexes from Column1 (Dates)
Set dU1 = CreateObject("Scripting.Dictionary")
lrU = Cells(Rows.Count, 1).End(xlUp).Row
cU1 = Range("A2:A" & lrU)
For iU1 = 1 To UBound(cU1, 1)
dU1(cU1(iU1, 1)) = 1
Next iU1
'Indexes from Column3 (Indicators)
Set dU2 = CreateObject("Scripting.Dictionary")
cU2 = Range("C2:C" & lrU)
j = 0
For iU2 = 1 To UBound(cU2, 1)
dU2(cU2(iU2, 1)) = 1
Next iU2
'If want to see values in dictionaries, uncomment following six lines
'For i = 0 To dU1.Count - 1
' MsgBox "dU1 has " & dU1.Count & " elements and key#" & i & " is " & dU1.Keys()(i)
'Next
'For i = 0 To dU2.Count - 1
' MsgBox "dU2 has " & dU2.Count & " elements and key#" & i & " is " & dU2.Keys()(i)
'Next
'The following code will look in the complete set of data for each index
'This accounts for unsorted data, but is resourse-consuming
'If your data is ordered for shure, just loop the desired rows
For i = 0 To dU1.Count - 1 'for each Date
For j = 0 To dU2.Count - 1 'for each Indicator
ReDim MyArray(1 To 1) As Variant 'reset the array
For k = 2 To lLastRowDB 'Scan all rows
If (Worksheets("Data").Cells(k, 1).Value = dU1.keys()(i)) Then
If (Worksheets("Data").Cells(k, 3).Value = dU2.keys()(j)) Then
MyArray(UBound(MyArray)) = Worksheets("Data").Cells(k, 4).Value 'add found value to array
ReDim Preserve MyArray(1 To UBound(MyArray) + 1) As Variant 'now array is 1 element longer
End If
End If
Next
'Now MyArray contains desired data.
'Get average and SD
lAV = Application.WorksheetFunction.Average(MyArray)
lSD = Application.WorksheetFunction.StDev(MyArray)
'Titles
Worksheets("Data").Cells(1, 5) = "Average"
Worksheets("Data").Cells(1, 6) = "SD"
Worksheets("Data").Cells(1, 7) = "z-scores"
For k = 2 To lLastRowDB
If (Worksheets("Data").Cells(k, 1).Value = dU1.keys()(i)) Then
If (Worksheets("Data").Cells(k, 3).Value = dU2.keys()(j)) Then
Worksheets("Data").Cells(k, 5) = lAV
Worksheets("Data").Cells(k, 6) = lSD
If lSD = 0 Then
Worksheets("Data").Cells(k, 7) = "SD is zero. Unable to calculate z-scores"
Else
Worksheets("Data").Cells(k, 7) = (Worksheets("Data").Cells(k, 4).Value - lAV) / lSD 'z-scores
End If
End If
End If
Next
Next
Next
End Sub
来源:https://stackoverflow.com/questions/35774530/multi-conditional-statistics-avg-std-dev-z-scores-for-large-data-sets-in-exc