Find the statistical mode(s) of a dataset in PowerShell

ε祈祈猫儿з 提交于 2020-01-02 07:41:10

问题


This self-answered question is a follow-up to this question:

How can I determine a given dataset's (array's) statistical mode, i.e. the one value or the set of values that occur most frequently?

For instance, in array 1, 2, 2, 3, 4, 4, 5 there are two modes, 2 and 4, because they are the values occurring most frequently.


回答1:


Use a combination of Group-Object, Sort-Object, and ForEach-Object:

# Sample dataset.
$dataset = 1, 2, 2, 3, 4, 4, 5

do { # dummy loop to allow efficient termination of the pipeline
  $dataset | Group-Object | Sort-Object Count -Descending | 
    ForEach-Object -Begin { $topCount = 0 } -Process { 
      if ($_.Count -lt $topCount) { break } # No longer top occurrence count, exit
      $topCount = $_.Count # Store the occurrence count.
      $_.Group[0] # Output the input value represented by the group.
    }
} while ($false)

The above yields 2 and 4, which are the two modes (values occurring most frequently, twice each in this case); the modes are returned in input order.

Note: While this solution is conceptually straightforward, performance with large datasets may be a concern; see the bottom section for an optimization that is possible for certain inputs.

Explanation:

  • Group-Object groups all inputs by equality.

  • Sort-Object -Descending sorts the resulting groups by member count in descending fashion (most frequently occurring inputs first).

  • The ForEach-Object command loops over the sorted groups and outputs the input represented by each for the / all groups with the highest occurrence count (frequency).

The reason for the dummy do loop is that, as of PowerShell Core 7.0.0-preview.5, there is no direct way to exit a pipeline, if processing of further inputs is no longer desired.

There's a longstanding feature request on GitHub to add support for this.

The workaround is to use an enclosing loop and break out of it with break.
Note: Do not use break or continue in a pipeline without an enclosing loop, as it would look up the call stack for an enclosing loop and exit the script, if there is none.

By contrast, while return can be used inside a ForEach-Object block in the pipeline, it only skips to the next input item - it doesn't stop processing of further inputs.


Better-performing solution:

If the input elements are uniformly simple numbers or strings (as opposed to complex objects), an optimization is possible:

  • Group-Object's -NoElement suppresses collecting the individual inputs in each group.

  • Each group's .Name property reflects the grouping value, but does so as a string, so it must be converted back to its original data type.

# Sample dataset.
# Must be composed of all numbers or strings.
$dataset = 1, 2, 2, 3, 4, 4, 5

# Determine the data type of the elements in the set
# (assumed to be homogeneous).
$dataType = $dataset[0].GetType()

do {
  # Note the use of -NoElement
  $dataset | Group-Object -NoElement | Sort-Object Count -Descending | 
    ForEach-Object -Begin { $topCount = 0 } -Process { 
      if ($_.Count -lt $topCount) { break }
      $topCount = $_.Count # Store the occurrence count.
      # Convert the string-valued .Name property
      # back to the original type.
      if ($dataType -eq [string]) {
        $_.Name
      } else {
        $dataType::Parse($_.Name)
      }
    }
} while ($false)


来源:https://stackoverflow.com/questions/58738500/find-the-statistical-modes-of-a-dataset-in-powershell

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!