Is it possible to speed-up python IO?

后端未结

关注

 8  965

Consider this python program:

import sys

lc = 0
for line in open(sys.argv[1]):
    lc = lc + 1

print lc, sys.argv[1]

Running it on my 6GB

相关标签:

8条回答

轻奢々

2020-12-16 16:22
You can't get any faster than the maximum disk read speed.

In order to reach the maximum disk speed you can use the following two tips:
1. Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
2. Do the newline counting in another thread, in parallel.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-16 16:22

as others have said - "no"

Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.

Another option: If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.

0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-12-16 16:27
The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.

First, be sure your 6GB file read is I/O bound, not CPU bound.

If It's I/O bound, consider the "Fan-Out" design pattern.
- A parent process spawns a bunch of children.
- The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
  
  A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
- Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.
0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2020-12-16 16:28

plain "no".

You've pretty much reached maximum disk speed.

I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-16 16:43

Throw hardware at the problem.

As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.

Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).

Governing Equation

(Amount to transfer) / (transfer rate) = (time to transfer)

(6000 MB) / (60 MB/s) = 100 seconds

(6000 MB) / (125 MB/s) = 48 seconds

Hardware Solutions

The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".

Or you could check out the WD Velociraptor hard drive (10,000 rpm).

Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).

0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-16 16:43
2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
1. Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
2. Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页