Python - CSV time oriented Transposing large number of columns to rows

问题

I have many csv files which are "column" oriented and that I need to pre-process to finally index them.

This is time oriented data, with a very large number of columns for each "device" (up to 128 columns) like:

LDEV_XXXXXX.csv             
Serial number : XXXXX(VSP)              
From : 2014/06/04 05:58             
To   : 2014/06/05 05:58             
sampling rate : 1               

"No.","time","00:30:00X(X2497-1)","00:30:01X(X2498-1)","00:30:02X(X2499-1)"
"242","2014/06/04 10:00",0,0,0
"243","2014/06/04 10:01",0,0,0
"244","2014/06/04 10:02",9,0,0
"245","2014/06/04 10:03",0,0,0
"246","2014/06/04 10:04",0,0,0
"247","2014/06/04 10:05",0,0,0

My goal is to transpose (if it the term is the right one) data into rows, such that i will be able to manipulate the data much more efficiently, such as:

"time",device,value
"2014/06/04 10:00","00:30:00X(X2497-1)",0
"2014/06/04 10:00","00:30:01X(X2498-1)",0
"2014/06/04 10:00","00:30:02X(X2499-1)",0
"2014/06/04 10:01","00:30:00X(X2497-1)",0
"2014/06/04 10:01","00:30:01X(X2498-1)",0
"2014/06/04 10:01","00:30:02X(X2499-1)",0
"2014/06/04 10:02","00:30:00X(X2497-1)",9
"2014/06/04 10:02","00:30:01X(X2498-1)",0
"2014/06/04 10:02","00:30:02X(X2499-1)",0

And so on...

Note: I have let the raw data (which is uses "," as a separator), you would note that I need to delete the 6 first lines the "No" column which has no interest, but this is not the main goal and difficulty)

I have a python starting code to transpose csv data, but it doesn't exactly what i need...

import csv
import sys
infile = sys.argv[1]
outfile = sys.argv[2]

with open(infile) as f:
    reader = csv.reader(f)
    cols = []
    for row in reader:
        cols.append(row)

with open(outfile, 'wb') as f:
    writer = csv.writer(f)
    for i in range(len(max(cols, key=len))):
        writer.writerow([(c[i] if i<len(c) else '') for c in cols])

Note the number of columns are arbitrary, something a few, and up to 128 depending on files.

I'm pretty sure this is a common need but I couldn't yet find the exact python code that does this, or I couldn't get...

Edit:

More precision:

Each timestamp row will be repeated by the number of devices, so that the file will have much more lines (multiplied by the number of devices) but only a few rows (timestamp,device,value) The final desired result has been updated :-)

Edit:

I would like to be able to use the script using argument1 for infile and argument2 for outfile :-)

回答1:

EDIT : Expect quotes (") around No., port code to python 2 with indication for python 3 and remove debugging print

EDIT2 : fixed stupid bug not incrementing indexes

EDIT3 : new version allowing the input file to contain multiple headers each followed by data

I am not sure it is worth to use csv module, because you separator is fixed, you have no quotes, and no field containing newline or separator character : line.strip.split(',') is enough.

Here is what I tried :

skip lines until one begins with No. and read fields after 2 firsts to get identifiers
proceed line by line
- take date on second field
- print on line for each field after 2 firsts using identifier

Code for python 2 (remove first line from __future__ import print_function for python 3)

from __future__ import print_function

class transposer(object):
    def _skip_preamble(self):
        for line in self.fin:
            if line.strip().startswith('"No."'):
                self.keys = line.strip().split(',')[2:]
                return
        raise Exception('Initial line not found')
    def _do_loop(self):
        for line in self.fin:
            elts = line.strip().split(',')
            dat = elts[1]
            ix = 0
            for val in elts[2:]:
                print(dat, self.keys[ix], val, sep=',', file = self.out)
                ix += 1

    def transpose(self, ficin, ficout):
        with open(ficin) as fin:
            with open(ficout, 'w') as fout:
                self.do_transpose(fin, fout)
    def do_transpose(self, fin, fout):
        self.fin = fin
        self.out = fout
        self._skip_preamble()
        self._do_loop()

Usage :

t = transposer()
t.transpose('in', 'out')

If input file contains multiple headers, it is necessary to reset the list of keys on each header :

from __future__ import print_function

class transposer(object):
    def _do_loop(self):
        line_number = 0
        for line in self.fin:
            line_number += 1
            line = line.strip();
            if line.strip().startswith('"No."'):
                self.keys = line.strip().split(',')[2:]
            elif line.startswith('"'):
                elts = line.strip().split(',')
                if len(elts) == (len(self.keys) + 2):
                    dat = elts[1]
                    ix = 0
                    for val in elts[2:]:
                        print(dat, self.keys[ix], val, sep=',', file = self.out)
                        ix += 1
                else:
                    raise Exception("Syntax error line %d expected %d values found %d"
                                    % (line_number, len(self.keys), len(elts) - 2))

    def transpose(self, ficin, ficout):
        with open(ficin) as fin:
            with open(ficout, 'w') as fout:
                self.do_transpose(fin, fout)
    def do_transpose(self, fin, fout):
        self.fin = fin
        self.out = fout
        self.keys = []
        self._do_loop()

回答2:

First you should get the data into the structure that you want, then you can write it out easily. Also, for csv's with complicated structure it's frequently more useful to open it with a DictReader.

from csv import DictReader, DictWriter

with open(csv_path) as f:
  table = list(DictReader(f, restval=''))

transformed = []
for row in table:
  devices = [d for d in row.viewkeys() - {'time', 'No.'}]
  time_rows = [{'time': row['time']} for i in range(len(devices))]
  for i, d in enumerate(devices):
    time_rows[i].update({'device': d, 'value': row[d]})
  transformed += time_rows

this produces a list like

[{'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:00'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:00'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:00'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:01'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:01'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:01'},  
 {'device': '00:30:00X(X2497-1)', 'value': '9', 'time': '2014/06/04 10:02'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:02'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:02'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:03'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:03'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:03'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:04'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:04'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:04'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:05'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:05'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:05'}]

which is exactly what we wanted. Then to write it back out you can use a DictWriter.

# you might sort transformed here so that it gets written out in whatever order you like

column_names = ['time', 'device', 'value']
with open(out_path, 'w') as f:
  writer = DictWriter(f, column_names)
  writer.writeheader()
  writer.writerows(transformed)

来源：https://stackoverflow.com/questions/24295855/python-csv-time-oriented-transposing-large-number-of-columns-to-rows

标签

python

csv

transpose