How to use awk variables in regular expressions?

前端 未结 5 1459
渐次进展
渐次进展 2020-12-03 09:52

I have a file called domain which contains some domains. For example:

google.com
facebook.com
...
yahoo.com

And I have ano

相关标签:
5条回答
  • 2020-12-03 10:15

    awk can match against a variable if you don't use the // regex markers.

    if ( $0 ~ regex ){ print $0; }

    In this case, build up the required regex as a string

    regex = dom"$"
    

    Then match against the regex variable

    if ( $1 ~ regex ) {
      domain[dom]+=$2;
    }
    
    0 讨论(0)
  • 2020-12-03 10:15

    The problem of the answers above is that you cannot use the "metacharacters" (e.g. \< for a word boundary at the beginning of a word) if you use a string instead of a regular expression /.../. If you had a domain xyz.com and two sites ab.xyz.com and cd.prefix_xyz.com, the numbers of the two site entries would be added to xyz.com

    Here's a solution using awk's pipe and the sed command: ...

    for(dom in domain) {
        while(getline < "./site" > 0) {
            # let sed replaces occurence of the domain at the end of the site
            cmd = "echo '" $1 "' | sed 's/\\<'" dom "'$/NO_VALID_DOM/'"
            cmd | getline x
            close(cmd)
            if (match(x, "NO_VALID_DOM")) { 
              domain[dom]+=$2;
            }
        }
        close("./site") # this misses in original code
    }
    

    ...

    0 讨论(0)
  • 2020-12-03 10:28

    First of all, the variable is dom not $dom -- consider $ as an operator to extract the value of the column number stored in the variable dom

    Secondly, awk will not interpolate what's between // -- that is just a string in there.

    You want the match() function where the 2nd argument can be a string that is treated as the regular expression:

    if (match($1, dom "$")) {...}
    

    I would code a solution like:

    awk '
      FNR == NR {domain[$1] = 0; next}
      {
        for (dom in domain) {
          if (match($1, dom "$")) {
            domain[dom] += $2
            break
          }
        }
      }
      END {for (dom in domain) {print dom, domain[dom]}}
    ' domain site 
    
    0 讨论(0)
  • 2020-12-03 10:28

    You clearly want to read the site file once, not once per entry in domain. Fixing that, though, is trivial.

    Equally, variables in awk (other than fields $0 .. $9, etc) are not prefixed with $. In particular, $dom is the field number identified by the variable dom (typically, that's going to be 0 since domain strings don't convert to any other number).

    I think you need to find a way to get the domain from the data read from the site file. I'm not sure if you need to deal with sites with country domains such as bbc.co.uk as well as sites in the GTLDs (google.com etc). Assuming you are not dealing with country domains, you can use this:

    BEGIN {
        while (getline dom < "./domain" > 0) domain[dom] = 0
        FS = "[ .]+"
        while (getline  < "./site" > 0)
        {
            topdom = $(NF-2) "." $(NF-1)
            domain[topdom] += $NF          
        }
        for (dom in domain) print dom "  " domain[dom]
    }
    

    In the second while loop, there are NF fields; $NF contains the count, and $1 .. $(NF-1) contain components of the domain. So, topdom ends up containing the top domain name, which is then used to index into the array initialized in the first loop.

    Given the data in the question (minus the lines of dots), the output is:

    yahoo.com  0
    facebook.com  37
    google.com  18
    
    0 讨论(0)
  • 2020-12-03 10:34

    One way using an awk script:

    BEGIN {
        FS = "[. ]"
        OFS = "."
    }
    
    FNR == NR {
        domain[$1] = $0
        next
    }
    
    FNR < NR {
        if ($2 in domain) {
            for ( i = 2; i < NF; i++ ) {
                if ($i != "") {
                    line = (line ? line OFS : "") $i
                }
            }
            total[line] += $NF
            line = ""
        }
    }
    
    END {
        for (i in total) {
            printf "%s\t%s\n", i, total[i]
        }
    }
    

    Run like:

    awk -f script.awk domain.txt site.txt
    

    Results:

    facebook.com    37
    google.com  18
    
    0 讨论(0)
提交回复
热议问题