The problem:
I need to apply multiple regexes to each line of a big log file (like several GB long) , gather non-empty matches and put them all in an array (for seri
I tried to distill your question into a very simple example.
Since there can be "hundreds of thousands of regex matches", I did a large initial allocation of 1 M (1024 * 1024) entries for the matches
slice capacity. A slice is a reference type. A slice header 'struct' has length, a capacity, and a pointer for a total of 24 (3 * 8) bytes on a 64-bit OS. The initial allocation for a slice of 1 M entries is therefore only 24 (24 * 1) MB. If there are more than 1 M entries, a new slice with capacity of 1.25 (1 + 1 / 4) M entries will be allocated and the existing 1 M slice header entries (24 MB) will be copied to it.
In summary, you can avoid much of the the overhead of many append
s by initially over allocating the slice capacity. The bigger memory problem is all the data that is saved and referenced for each match. The far bigger CPU time problem is the time taken to perform the regexp.FindAll
's.
package main
import (
"bufio"
"fmt"
"os"
"regexp"
)
var searches = []*regexp.Regexp{
regexp.MustCompile("configure"),
regexp.MustCompile("unknown"),
regexp.MustCompile("PATH"),
}
var matches = make([][]byte, 0, 1024*1024)
func main() {
logName := "config.log"
log, err := os.Open(logName)
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
defer log.Close()
scanner := bufio.NewScanner(log)
for scanner.Scan() {
line := scanner.Bytes()
for _, s := range searches {
for _, m := range s.FindAll(line, -1) {
matches = append(matches, append([]byte(nil), m...))
}
}
}
if err := scanner.Err(); err != nil {
fmt.Fprintln(os.Stderr, err)
}
// Output matches
fmt.Println(len(matches))
for i, m := range matches {
fmt.Println(string(m))
if i >= 16 {
break
}
}
}