Processing apache logs quickly

前端 未结 5 467
猫巷女王i
猫巷女王i 2021-01-03 01:21

I\'m currently running an awk script to process a large (8.1GB) access-log file, and it\'s taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I ex

5条回答
  •  日久生厌
    2021-01-03 02:15

    This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):

    import time
    
    src = open('x.log', 'r')
    dest = open('x.csv', 'w')
    
    for line in src:
        ip = line[:line.index(' ')]
        date = line[line.index('[') + 1:line.index(']') - 6]
        t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
        dest.write(ip)
        dest.write(',')
        dest.write(str(int(t)))
        dest.write('\n')
    
    src.close()
    dest.close()
    

    A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.

    But to be honest, something as simple as that should be just as easy to rewrite in C.

提交回复
热议问题