Processing a sub-list of variable size within a larger list

问题

I'm a biological engineering PhD student here trying to self-learn Python programming for use in automating a part of my research, but I've ran into a problem with processing sub-lists within a bigger list that I can't seem to solve.

Basically, the goal of what I'm trying to do is write a small script that will process a CSV file containing a list of plasmid sequences that I'm building using various DNA assembly methods, and then spit out the primer sequences that I need to order in order to build the plasmid.

Here's the scenario that I'm dealing with:

When I want to build a plasmid, I have to enter into my Excel spreadsheet the full sequence of that plasmid. I have to choose between two DNA assembly methods, called "Gibson" and "iPCR". Each "iPCR" assembly only requires one line in the list, so I know how to process those guys already, as I just have to put in one cell the full sequence of the plasmid I'm trying to build. "Gibson" assemblies, on the other hand, require that I have to split up the full DNA sequence into smaller chunks, so sometimes I need 2-5 lines within the Excel spreadsheet to fully describe one plasmid.

So I end up with a spreadsheet that sort of ends up looking like this:

Construct.....Strategy.....Name

1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
2.....iPCR.......P(cpcG2)-K1F controller with K1F pos. feedback
3.....Gibson.....P(cpcG2)-K1F controller with swapped promoter positions
3.....Gibson.....P(cpcG2)-K1F controller with swapped promoter positions
4.....iPCR.......P(cpcG2)-K1F controller with stronger K1F RBS library

I think the list at this length is representative enough.

So the problem I'm running into is, I'd like to be able to run through the list and process the Gibsons, but I can't seem to get the code to work the way I want. Here's the code I've written so far:

#import BioPython Tools
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

#import csv tools
import csv
import sys
import os

with open('constructs-to-make.csv', 'rU') as constructs:
    construct_list = csv.reader(constructs, delimiter=',')
    construct_list.next()
    construct_number = 1
    primer_list = []
    temp_list = []
    counter = 2

    for row in construct_list:
        print('Current row is row number ' + str(counter))
        print('Current construct number is ' + str(construct_number))
        print('Current assembly type is ' + row[1])
        if row[1] == "Gibson": #here, we process the Gibson assemblies first
            print('Current construct number is: #' + row[0] + ' on row ' + str(counter) + ', which is a Gibson assembly')
##            print(int(row[0]))
##            print(row[3])
            if int(row[0]) == construct_number:
                print('Adding DNA sequence from row ' + str(counter) + ' for construct number ' + row[0])
                temp_list.append(str(row[3]))
                counter += 1
            if int(row[0]) > construct_number:
                print('Current construct number is ' + str(row[0]) + ', which is greater than the current construct number, ' + str(construct_number))
                print('Therefore, going to work on construct number ' + str(construct_number))
                for part in temp_list: #process the primer design work here
                    print('test')
##                    print(part)
                construct_number += 1
                temp_list = []
                print('Adding DNA from row #' + str(counter) + ' from construct number ' + str(construct_number))
                temp_list.append(row)
                print('Next construct number is number ' + str(construct_number))
                counter += 1
##            counter += 1
        if str(row[1]) == "iPCR":
            print('Current construct number is: ' + row[0] + ' on row ' + str(counter) + ', which is an iPCR assembly.')
            #process the primer design work here
            #get first 60 nucleotides from the sequence
            sequence = row[3]
            fw_primer = sequence[1:61]
            print('Sequence of forward primer:')
            print(fw_primer)
            last_sixty = sequence[-60:]
##            print(last_sixty)
            re_primer = Seq(last_sixty).reverse_complement()
            print('Sequence of reverse primer:')
            print(re_primer)
            #ending code: add 1 to counter and construct number
            counter += 1
            construct_number += 1
##            if int(row[0]) == construct_number:
##        else:
##            counter += 1
##            construct_number += 1
##    print(temp_list)

##        for row in temp_list:
##    print(temp_list)        
##    print(temp_list[-1])
#                fw_primer = temp_list[counter - 1].

(I know the code probably looks noob - I've never done any programming class beyond introductory Java.)

The problem with this code is that if I have n "constructs" (a.k.a. plasmids) that I'm trying to build by "Gibson" assembly, it will process the first n-1 plasmids, but not the last one. I also can't think of any better way to write this code, however, but I can see that for the workflow that I'm trying to implement, knowing how to process "n" things in a list, but with each "thing" of variable numbers of rows, would come in really handy for me.

I'd really appreciate anybody's help here! Thanks a lot!

回答1:

Just some general coding help with python. If you haven't read PEP8 do so.

To maintain clear code it can be helpful to assign variables to fields referenced in a record/row.

I would add something like this for any field referenced:

construct_idx = 0

Also, I would recommend using string formatting, it's cleaner.

So:

print('Current construct number is: #{} on row {}, which is a Gibson assembly'.format(row[construct_idx], counter))

Instead of:

print('Current construct number is: #' + row[0] + ' on row ' + str(counter) + ', which is a Gibson assembly')

If you're creating a csv reader object, making it's variable name "*_list" can be miss-leading. Calling it "*_reader" is more intuitive.

construct_reader = csv.reader(constructs, delimiter=',')

Instead of:

construct_list = csv.reader(constructs, delimiter=',')

回答2:

The problem with this code is that if I have n "constructs" (a.k.a. plasmids) that I'm trying to build by "Gibson" assembly, it will process the first n-1 plasmids, but not the last one.

This is actually a general problem, and the simplest way around it is to add a check after the loop, like this:

for row in construct_list:
    do all your existing code
if we have a current Gibson list:
    repeat the code to process it.

Of course you don't want to repeat yourself… so you move that work into a function, which you call in both places.

However, I'd probably write this differently, using groupby. I know this will probably seem "way too advanced" at first glance, but it's worth trying to see if you can understand it, because it makes things a lot simpler.

def get_strategy(row):
    return row[0]
for group in itertools.groupby(construct_list, key=get_strategy):

Now, you'll get each construct as a separate list, so you don't need the temp_list at all. For example, the first group will be:

[[1, 'Gibson', 'P(OmpC)-cI::P(cI)-LacZ controller'],
 [1, 'Gibson', 'P(OmpC)-cI::P(cI)-LacZ controller'],
 [1, 'Gibson', 'P(OmpC)-cI::P(cI)-LacZ controller']]

The next will be:

[[2, 'iPCR', 'P(cpcG2)-K1F controller with K1F pos. feedback']]

And there won't be a left-over group at the end to worry about.

So:

for group in itertools.groupby(construct_list, key=get_strategy):
    construct_strategy = get_strategy(group[0])
    if construct_strategy == "Gibson":
        # your existing code, using group instead of temp_list,
        # and no need to maintain temp_list at all
    elif construct_strategy == 'iPCR":
        # your existing code, using group[0] instead of row

Once you get over the abstraction hurdle, it's a whole lot simpler to think about the problem this way.

In fact, once you start to grasp iterators intuitively, you'll start finding that itertools (and the recipes on its docs page, and the third-party library more_itertools, and similar code you can write yourself) turn a lot of complicated questions into very simple ones. The answer to "How do I keep track of the current group of matching rows within a list of rows?" is "Keep a temporary list, and remember to check it every time the group changes and then check again at the end for leftovers", but the answer to the equivalent question "How do I transform row iteration into row-group iteration?" is "Wrap the iterator in groupby."

You also might want to add in an assert or other check that all(row[1] == construct_strategy for row in group[1:]), that len(group) == 1 in the iPCR case, that there is no unexpected third strategy, etc., so when you inevitable run into an error, it'll be easier to tell whether it was bad data or bad code.

Meanwhile, instead of using a csv.reader, skipping the first row, and referring to the columns by meaningless numbers, it might be better to use a DictReader:

with open('constructs-to-make.csv', 'rU') as constructs:
    primer_list = []
    def get_strategy(row):
        return row["Strategy"]
    for group in itertools.groupby(csv.DictReader(constructs), key=get_strategy):
        # same as before, but with
        # ... row["Construct"] instead of row[0]
        # ... row["Strategy"] instead of row[1]
        # ... row["Name"] instead of row[2]

来源：https://stackoverflow.com/questions/14063387/processing-a-sub-list-of-variable-size-within-a-larger-list

标签

python

nested-lists

dna-sequence