Efficient appending to a variable-length container of strings (Golang)

前端 未结 2 791
逝去的感伤
逝去的感伤 2021-01-18 13:27

The problem:

I need to apply multiple regexes to each line of a big log file (like several GB long) , gather non-empty matches and put them all in an array (for seri

2条回答
  •  Happy的楠姐
    2021-01-18 13:46

    I tried to distill your question into a very simple example.

    Since there can be "hundreds of thousands of regex matches", I did a large initial allocation of 1 M (1024 * 1024) entries for the matches slice capacity. A slice is a reference type. A slice header 'struct' has length, a capacity, and a pointer for a total of 24 (3 * 8) bytes on a 64-bit OS. The initial allocation for a slice of 1 M entries is therefore only 24 (24 * 1) MB. If there are more than 1 M entries, a new slice with capacity of 1.25 (1 + 1 / 4) M entries will be allocated and the existing 1 M slice header entries (24 MB) will be copied to it.

    In summary, you can avoid much of the the overhead of many appends by initially over allocating the slice capacity. The bigger memory problem is all the data that is saved and referenced for each match. The far bigger CPU time problem is the time taken to perform the regexp.FindAll's.

    package main
    
    import (
        "bufio"
        "fmt"
        "os"
        "regexp"
    )
    
    var searches = []*regexp.Regexp{
        regexp.MustCompile("configure"),
        regexp.MustCompile("unknown"),
        regexp.MustCompile("PATH"),
    }
    
    var matches = make([][]byte, 0, 1024*1024)
    
    func main() {
        logName := "config.log"
        log, err := os.Open(logName)
        if err != nil {
            fmt.Fprintln(os.Stderr, err)
            os.Exit(1)
        }
        defer log.Close()
        scanner := bufio.NewScanner(log)
        for scanner.Scan() {
            line := scanner.Bytes()
            for _, s := range searches {
                for _, m := range s.FindAll(line, -1) {
                    matches = append(matches, append([]byte(nil), m...))
                }
            }
        }
        if err := scanner.Err(); err != nil {
            fmt.Fprintln(os.Stderr, err)
        }
        // Output matches
        fmt.Println(len(matches))
        for i, m := range matches {
            fmt.Println(string(m))
            if i >= 16 {
                break
            }
        }
    }
    

提交回复
热议问题