I am dabbling in Powershell and completely new to .NET.
I am running a PS script that starts with an empty hash table. The hash table will grow to at least 15,000 to
I performed some basic tests using Measure-Command, using a set of 20 000 random words.
The individual results are shown below, but in summary it appears that adding to one hashtable by first allocating a new hashtable with a single entry is incredibly inefficient :) Although there were some minor efficiency gains among options 2 through 5, in general they all performed about the same.
If I were to choose, I might lean toward option 5 for its simplicity (just a single Add call per string), but all the alternatives I tested seem viable.
$chars = [char[]]('a'[0]..'z'[0])
$words = 1..20KB | foreach {
$count = Get-Random -Minimum 15 -Maximum 35
-join (Get-Random $chars -Count $count)
}
# 1) Original, adding to hashtable with "+=".
# TotalSeconds: ~800
Measure-Command {
$h = @{}
$words | foreach { if( $h[$_] -ne $true ) { $h += @{ $_ = $true } } }
}
# 2) Using sharding among sixteen hashtables.
# TotalSeconds: ~3
Measure-Command {
[hashtable[]]$hs = 1..16 | foreach { @{} }
$words | foreach {
$h = $hs[$_.GetHashCode() % 16]
if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) }
}
}
# 3) Using ContainsKey and Add on a single hashtable.
# TotalSeconds: ~3
Measure-Command {
$h = @{}
$words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}
# 4) Using ContainsKey and Add on a hashtable constructed with capacity.
# TotalSeconds: ~3
Measure-Command {
$h = New-Object Collections.Hashtable( 21KB )
$words | foreach { if( -not $h.ContainsKey( $_ ) ) { $h.Add( $_, $null ) } }
}
# 5) Using HashSet and Add.
# TotalSeconds: ~3
Measure-Command {
$h = New-Object Collections.Generic.HashSet[string]
$words | foreach { $null = $h.Add( $_ ) }
}