Find the statistical mode(s) of a dataset in PowerShell

问题

This self-answered question is a follow-up to this question:

How can I determine a given dataset's (array's) statistical mode, i.e. the one value or the set of values that occur most frequently?

For instance, in array 1, 2, 2, 3, 4, 4, 5 there are two modes, 2 and 4, because they are the values occurring most frequently.

回答1:

Use a combination of Group-Object, Sort-Object, and ForEach-Object:

# Sample dataset.
$dataset = 1, 2, 2, 3, 4, 4, 5

do { # dummy loop to allow efficient termination of the pipeline
  $dataset | Group-Object | Sort-Object Count -Descending | 
    ForEach-Object -Begin { $topCount = 0 } -Process { 
      if ($_.Count -lt $topCount) { break } # No longer top occurrence count, exit
      $topCount = $_.Count # Store the occurrence count.
      $_.Group[0] # Output the input value represented by the group.
    }
} while ($false)

The above yields 2 and 4, which are the two modes (values occurring most frequently, twice each in this case); the modes are returned in input order.

Note: While this solution is conceptually straightforward, performance with large datasets may be a concern; see the bottom section for an optimization that is possible for certain inputs.

Explanation:

Group-Object groups all inputs by equality.
Sort-Object -Descending sorts the resulting groups by member count in descending fashion (most frequently occurring inputs first).
The ForEach-Object command loops over the sorted groups and outputs the input represented by each for the / all groups with the highest occurrence count (frequency).

The reason for the dummy do loop is that, as of PowerShell Core 7.0.0-preview.5, there is no direct way to exit a pipeline, if processing of further inputs is no longer desired.

There's a longstanding feature request on GitHub to add support for this.

The workaround is to use an enclosing loop and break out of it with break.
Note: Do not use break or continue in a pipeline without an enclosing loop, as it would look up the call stack for an enclosing loop and exit the script, if there is none.

By contrast, while return can be used inside a ForEach-Object block in the pipeline, it only skips to the next input item - it doesn't stop processing of further inputs.

Better-performing solution:

If the input elements are uniformly simple numbers or strings (as opposed to complex objects), an optimization is possible:

Group-Object's -NoElement suppresses collecting the individual inputs in each group.
Each group's .Name property reflects the grouping value, but does so as a string, so it must be converted back to its original data type.

# Sample dataset.
# Must be composed of all numbers or strings.
$dataset = 1, 2, 2, 3, 4, 4, 5

# Determine the data type of the elements in the set
# (assumed to be homogeneous).
$dataType = $dataset[0].GetType()

do {
  # Note the use of -NoElement
  $dataset | Group-Object -NoElement | Sort-Object Count -Descending | 
    ForEach-Object -Begin { $topCount = 0 } -Process { 
      if ($_.Count -lt $topCount) { break }
      $topCount = $_.Count # Store the occurrence count.
      # Convert the string-valued .Name property
      # back to the original type.
      if ($dataType -eq [string]) {
        $_.Name
      } else {
        $dataType::Parse($_.Name)
      }
    }
} while ($false)

来源：https://stackoverflow.com/questions/58738500/find-the-statistical-modes-of-a-dataset-in-powershell

标签

powershell

statistics