Using Select-String for checking two .txt files in powershell

a 夏天 提交于 2021-02-17 03:49:02

问题


I am complete new in writting powershell scripts. So far I was using plain batch for my purpose as this is the requirement by my company. Inside this batch I am using nested foor loops to make a comparison of two .txt files, in detail I wantdo do the following:

  • File 1 contains lots of strings. Each string is in one seperate line with a preceded number and semicolon like so: 658;RMS
  • File 2 is some long text.

The aim is to count the amount of occurences of each string from File 1 in File 2, e.g. RMS is counted 300 times.

As my previous code hase some huge drawbacks concerning runtime (File 1 has approx. 400 lines and File 2 500.000) I read that the Select-String from Powershell is much more efficient. However, as I am reading some tutorials it is not clear to me how I can proceed here, beside that I have to run the powershellcode inside my .bat. My biggest problem is I am not sure how and where to place my 'variables', so the two inputfiles 1 and 2

So far I was testing the Select-String method like this:

powershell -command "& {Select-String -Path *.txt -Pattern "RMS"}"

My assumption would be to make use of piping, so something like this:

powershell -command "& {<<path to file one, should read line by line>> | Select-String -Path File2.txt -Pattern "value of file 1"}"

However, I am not getting this to work. Powershell is excpecting some kind of psobject before the first pipe?


回答1:


For optimal performance, I would approach this task like so.

  • Read the file with the terms as a CSV (it is a CSV, with a ; delimiter)
  • Read the other file into a string
  • For each term, count how often it can be found in the target string (using .IndexOf())

For example

$data = Import-Csv "file1.txt" -Delimiter ";" -Header ID,Term 
$target = Get-Content "file2.txt" -Raw
$counts = @{}

foreach ($term in $data.Term) {
    $index = -1
    $count = 0
    do {
        $index = $target.IndexOf($term, $index + 1)
        if ($index -gt -1) { $count++ } else { break; }
    } while ($true);
    $counts[$term] = $count
}

$counts 

Notes

  • Import-Csv will automatically use the first line in the input file as the header. If your file already has a header, you can remove the -Headers parameter.
  • Get-Content will will read the input file into an array of lines by default. But for this approach, having the entire file as one big string is the right thing - that's what -Raw does.
  • @{} creates an empty hashtable
  • $data.Term will access one column of the CSV
  • .IndexOf() is case sensitive. By default, PowerShell is case-insenstive, but native .NET methods like this one will not change their behavior. This might or might not be what you need - use .ToLower() on the $target and the $term if you don't care for case.



回答2:


Select-String is useful, but it isn't magic :)

Performance impact in mind, I would approach it like this:

  • For each line in File2:
    • Test for occurences of all terms in File1

This way, you only need to read and evalulate File2 once:

# prepare hashtable to keep track of count
$count = @{}

# read terms to search for from file1
$termsToFind = Get-Content .\file1 |ForEach-Object {
  $_ -split ';' |Select -Last 1
}

# loop over lines in file2, count the words we're searching for
Get-Content .\test\file2 |ForEach-Object {
  foreach($term in $termsToFind){
    # Using `Regex.Matches()` will help us find multiple occurrences of the same term
    $count[$term] += [regex]::Matches($_,"\b$([regex]::Escape($term))\b").Count
  }
}

Now $count will be a hashtable where the key is the term from file1, and the value is the count of each word.

Output to the same format as file1 with:

$count.GetEnumerator() |ForEach-Object { $_.Value,$_.Key -join ';' } |Set-Content output.txt



回答3:


If you check the docs, you can't pipe -pattern to select-string. You can use parentheses to make the output of something become the pattern argument:

powershell select-string -pattern (get-content file1) -path file2    

Using the fact that pattern is position 0 and path is position 1. -pattern can also be an array.

powershell select-string (get-content file1) file2  


来源:https://stackoverflow.com/questions/62003411/using-select-string-for-checking-two-txt-files-in-powershell

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!