reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)

故事扮演 提交于 2019-11-27 09:34:56

问题


My source data is in a TSV file, 6 columns and greater than 2 million rows.

Here's what I'm trying to accomplish:

  1. I need to read the data in 3 of the columns (3, 4, 5) in this source file
  2. The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
  3. I want to write the output of #2 to an output file in CSV format.

Below is what I came up with.

My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.

First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:

Row1_Column1    Row1-Column2    Row1-Column3    Row1-Column4    2   Row1-Column6
Row2_Column1    Row2-Column2    Row2-Column3    Row2-Column4    3   Row2-Column6
Row3_Column1    Row3-Column2    Row3-Column3    Row3-Column4    1   Row3-Column6
Row4_Column1    Row4-Column2    Row4-Column3    Row4-Column4    2   Row4-Column6

then I have this code:

import csv 

with open('sample.txt','r') as tsv:
    AoA = [line.strip().split('\t') for line in tsv]

for a in AoA:
    count = int(a[4])
    while count > 0:
        with open('sample_new.csv', 'a', newline='') as csvfile:
            csvwriter = csv.writer(csvfile, delimiter=',')
            csvwriter.writerow([a[2], a[3]])
        count = count - 1

回答1:


You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

import csv

with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows([row[2:4] for _ in range(count)])

or, using the itertools module to do the repeating with itertools.repeat():

from itertools import repeat
import csv

with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows(repeat(row[2:4], count))


来源:https://stackoverflow.com/questions/13992971/reading-and-parsing-a-tsv-file-then-manipulating-it-for-saving-as-csv-efficie

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!