chunk a text database into N equal blocks and retain header

被刻印的时光 ゝ 提交于 2019-12-02 01:41:33
icyrock.com

You can do something like this:

with open('file') as file:
  lines = file.readlines()

headers = lines[0:1]
rest = lines[1:]
chunk_size = 4

def chunks(lst, chunk_size):
  for i in xrange(0, len(lst), chunk_size):
    yield lst[i:i + chunk_size]

def write_rows(rows, file):
  for row in rows:
    file.write('%s' % row)

part = 1
for chunk in chunks(rest, chunk_size):
  with open('part%d' % part, 'w') as file:
    write_rows(headers, file)
    write_rows(chunk, file)
  part += 1

Here's a test run:

$ cat file && python mkt.py && for p in part*; do echo ---- $p; cat $p; done
header
1
2
3
4
5
6
7
8
9
10
11
12
13
14
---- part1
header
1
2
3
4
---- part2
header
5
6
7
8
---- part3
header
9
10
11
12
---- part4
header
13
14

Obviously, change the values of the chunk_size and how you fetch headers depending on their count.

Credits:

Edit - to do this line-by-line to avoid memory issues, you can do something like this:

from itertools import islice

headers_count = 5
chunk_size = 250000

with open('file') as fin:
  headers = list(islice(fin, headers_count))

  part = 1
  while True:
    line_iter = islice(fin, chunk_size)
    try:
      first_line = line_iter.next()
    except StopIteration:
      break
    with open('part%d' % part, 'w') as fout:
      for line in headers:
        fout.write(line)
      fout.write(first_line)
      for line in line_iter:
        fout.write(line)
    part += 1

Credits:

Test case (put the above in the file called mkt2.py):

Make a file containing 5-line header and 1234567 lines in it:

with open('file', 'w') as fout:
  for i in range(5):
    fout.write(10 * ('header %d ' % i) + '\n')
  for i in range(1234567):
    fout.write(10 * ('line %d ' % i) + '\n')

Shell script to test (put in file called rt.sh):

rm part*
echo ---- file
head -n7 file
tail -n2 file

python mkt2.py

for i in part*; do
  echo ---- $i
  head -n7 $i
  tail -n2 $i
done

Sample output:

$ sh rt.sh 
---- file
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 
---- part1
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 line 0 
line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 line 1 
line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 line 249998 
line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 line 249999 
---- part2
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 line 250000 
line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 line 250001 
line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 line 499998 
line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 line 499999 
---- part3
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 line 500000 
line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 line 500001 
line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 line 749998 
line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 line 749999 
---- part4
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 line 750000 
line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 line 750001 
line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 line 999998 
line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 line 999999 
---- part5
header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 header 0 
header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 header 1 
header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 header 2 
header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 header 3 
header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 header 4 
line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 line 1000000 
line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 line 1000001 
line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 line 1234565 
line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 line 1234566 

Timing of the above:

real    0m0.935s
user    0m0.708s
sys     0m0.200s

Hope this helps.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!