What is the fastest way to read 10 GB file from the disk?

后端未结

关注

 13  3399

We need to read and count different types of messages/run some statistics on a 10 GB text file, e.g a FIX engine log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but the

相关标签:

13条回答

Happy的楠姐

2021-02-20 01:51

Perhaps you've already read this forum thread, but if not:

http://www.perlmonks.org/?node_id=512221

It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.

Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.

Best of luck.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-02-20 01:51

Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.

Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2021-02-20 01:52

Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.

0 讨论(0)
发布评论:

提交评论
- 加载中...
礼貌的吻别

2021-02-20 01:55

Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2021-02-20 01:56

Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).

Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-02-20 01:56

hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit, so just call it 5 times in sequence. That should be fairly fast.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页