7

#PSTip Count occurrences of a word using a hash table

On one of the PowerShell forums someone asked for help with getting the 15 most used words in a webpage. The core of the answer to that question is amazingly clever use of a hash table.

PS> $wordList = 'three','three','one','three','two','two'
PS> $wordStatistic = $wordList | ForEach-Object -Begin { $wordCounts=@{} } -Process { $wordCounts.$_++ } -End { $wordCounts }
PS> $wordStatistic

Name  Value
----  -----
one   1
three 3
two   2

The result correctly states that the word ‘three’ occurs in the word list three times, the ‘two’ is there two times, and the ‘one’, not surprisingly, once.

To understand how the trick works let’s go through it step by step. The first word in the $wordList array – the word ‘three’ – is passed down the pipeline. The $wordCounts hash table, created in the Begin block, is queried for key named ‘three’, in our case represented by the current object in pipeline variable $_. The value of the key value pair named ‘three’ is increased by one using the increment operator ‘++’. If the key is not present in the hash table it is automatically created. One by one the ForEach-Object loop processes all the words in the array incrementing appropriate key by one on each iteration. To complete the task you simply output each key-value pair of the $wordStatistic hash table using the GetEnumerator() method, sort them by the Value property, and select just the most used words, in our case just one.

$wordStatistic.GetEnumerator() |
Sort-Object -Property Value -Descending |
Select-Object -First 1

Name  Value

----  -----

three 3
Filed in: Columns, Tips and Tricks Tags: ,

7 Responses to "#PSTip Count occurrences of a word using a hash table"

  1. Kevin Doblosky says:

    Here’s an even easier way to do this:

    $wordList = ‘three’,’three’,’one’,’three’,’two’,’two’

    $wordList | Group-Object $_ | Select-Object Name, Count

    or, if you’d rather use aliases, and fewer characters:

    $wordList | group | select Name, Count

    • Vladimír Meier says:

      Yes, it’s easier way and i use it often, but if you have a lot of text to group, it may be better to use James Jares TIP because it is faster (120kB text – measure-command Group way 17 sec, James TIP 1 sec)

      • JakubJares says:

        Thanks for having my back. You are right the cmdlet method is slower, but on my station it seems to take only two times as much as the hashtable method. How did you test? Me on 26kB file, does the size matter that much?

      • Josh says:

        Try with the -NoElement switch on Group-Object. Otherwise it has to build up a collection of each occurrence of every word and could account for the additional processing time.

    • JakubJares says:

      Hello Kevin and Vladimir, thank you for your comments. It is always great to hear back from readers and especially when they provide constructive criticism.

      I never thought about counting the words using group and select object as you suggested, it definitely is more readable.

Leave a Reply

Submit Comment

© 2016 PowerShell Magazine. All rights reserved. XHTML / CSS Valid.
Proudly designed by Theme Junkie.
%d bloggers like this: