Reading selected column only from CSV file, when all other columns are guaranteed to be identical

问题

I have a bunch of CSV files that Im trying to concatenate into one single csv file . The CSV files are separated by a single space and look like this:

'initial', 'pos', 'orientation', 'ratio'
'chr', '106681', '+', '0.06'
'chr', '106681', '+', '0.88'
'chr', '106681', '+', '0.01'
'chr', '106681', '+', '0.02'

As you can see, all the values are the same except for the ratio. The concatenated file I am creating will look like this:

'filename','initial', 'pos', 'orientation', 'ratio1','ratio2','ratio3'
'jon' , 'chr', '106681', '+', '0.06' , '0.88' ,'0.01'

So basically, ill be iterating through each file, storing only one value of the initial , pos, orientation but all the values of the ratio and updating the table in the concatenated file. This is proving much more confusing than i though it would be. I have the following piece of code to read the csv files:

concatenated_file  = open('josh.csv', "rb")
reader = csv.reader(concatenated_file)

for row in reader:
    print row

which gives:

['chrom', 'pos', 'strand', 'meth_ratio']
['chr2', '106681786', '+', '0.06']
['chr2', '106681796', '+', '0.88']
['chr2', '106681830', '+', '0.01']
['chr2', '106681842', '+', '0.02']

It would be really helpful if some one can show me how to store only one value of the initial , pos, orientation (because they remain same) but all the values of the ratio

回答1:

This is a one-liner with pandas.read_csv(). And we can even drop the quoting too:

import pandas as pd

csva = pd.read_csv('a.csv', header=0, quotechar="'", delim_whitespace=True)

csva['ratio']
0    0.06
1    0.88
2    0.01
3    0.02
Name: ratio, dtype: float64

A couple of points:

actually your separator is comma + whitespace. In that sense it's not plain-vanilla CSV. See "How to make separator in read_csv more flexible?"
note we dropped the quoting on numeric fields, by setting quotechar="'"
if you really insist on saving memory (don't), you can drop all other columns of csva than 'ratio', after you do the read_csv. See the pandas doc.

回答2:

First put it in English terms.

You have to read all those other fields from somewhere, so it might as well be from the first row.

Then, having done that, you need to read the last column from each subsequent row and pack it onto the end of the new row, while ignoring the rest.

So, to turn that into Python:

with open(outpath, 'wb') as outfile:
    writer = csv.writer(outfile)
    for inpath in paths:
        with open(inpath, 'rb') as infile:
            reader = csv.reader(infile)

            # Read all values (including the ratio) from first row
            new_row = next(reader)

            # For every subsequent row...
            for row in reader:
                # ... read the ratio, pack it on, ignore the rest
                new_row.append(row[-1])

            writer.writerow(new_row)

I'm not sure the comments actually add anything; I think my Python is easier to follow than my English. :)

It's worth knowing that what you're trying to do here is called "denormalization". From what I can tell, your data will end up with an arbitrary number of ratio columns per row, all of which have the same "meaning", so each row isn't really a value anymore, but a collection of values.

Denormalization is generally considered bad, for a variety of reasons. There are cases where denormalized data is easier or faster to work with—as long as you know that you're doing it, and why, it can be a useful thing to do. Wikipedia has a nice article on database normalization that explains the issues; you might want to read it so you understand what you're doing here, and can make sure that it's the right thing to do.

来源：https://stackoverflow.com/questions/25902166/reading-selected-column-only-from-csv-file-when-all-other-columns-are-guarantee

标签

python

csv

file-format