How to read two lines from a file and create dynamics keys in a for-loop, a follow-up

泄露秘密 提交于 2019-12-03 12:37:50

The question is a bit old, but interesting because you have a very clear specification and you need help to write the code. I will expose a solution following a top-down approach, which is a very well known method, using plain old python. It shouldn't be difficult to adapt to pandas.

The top-down approach means to me: if you don't know how to write it, just name it!

You have a file (or a string) as input, and you want to output a file (or a string). It seems quite simple, but you want to merge pairs of rows to build every new row. The idea is:

  1. get the rows of the input, as dictionaries
  2. take them by two
  3. build a new row for each pair
  4. output the result

You don't know for now how to write the generator of rows. You don't know either how to build a new row for each pair. Don't stay blocked by the difficulties, just name the solutions. Imagine you have a function get_rows and a function build_new_row. Let's write this:

def build_new_rows(f):
    """generate the new rows. Output may be redirected to a file"""
    rows = get_rows(f) # get a generator on rows = dictionaries.
    r1 = next(rows) # store the first row
    for r2 in rows: # for every following row
        yield build_new_row(r1, r2) # yield a new row built of the previous stored row and the current row.
        r1 = r2 # store the current row, which becomes the previous row

Now, examine the two "missing" functions: get_rows and build_new_row. The function get_rows is quite easy to write. Here's the main part:

header = process_line(next(f))
for line in f:
    yield {k:v for k,v in zip(header, process_line(line))}

where process_line just splits the line on space, e.g. with a re.split("\s+", line.strip()).

The second part is build_new_row. Still the top-down approach: you need to build H0 and H1 from your expected table, and then to build the count of H1 for every M and S according to the conditions you exposed. Pretend you have a pipe_compute function that compute H0 and H1, and a build_count function that builds the count of H1 for every M and S:

def build_new_row(r1, r2):
    """build a row"""
    h0, h1 = pipe_compute(r1["F1_hybrid"], r2["F1_hybrid"])

    # initialize the dict whith the pos, H0 and H1
    new_row = {"pos":r2["pos"], "H0":h0, "H1":h1}

    for key in r1.keys():
        if key[0] in ("M", "S"):
            new_row[key] = build_count(r1[key], r2[key], h1)

    return new_row

You have almost everything now. Take a look at pipe_compute: it's exactly what you have written in your condition 03.

def pipe_compute(v1, v2):
    """build H0 H1 according to condition 03"""
    xs = v1.split("|")
    ys = v2.split("|")
    return [ys[0]+"g"+xs[0], ys[1]+"g"+xs[1]]

And for buid_count, stick to the top-down approach:

def build_count(v1, v2, to_count):
    """nothing funny here: just follow the conditions"""
    if is_slash_count(v1, v2): # are conditions 01, 02, 04 true ?
        c = slash_count(v1, v2)[to_count] # count how many "to_count" we find in the 2 x 2 table of condtions 01 or 02.
    elif "|" in v1 and "|" in v2: # condition 03
        c = pipe_count(v1, v2)[to_count]
    elif "." in v1 or "." in v2: # condition 05
        return '0'
    else:
        raise Exception(v1, v2)

    return "{}-{}".format(c, to_count) # n-XgY

We are still going down. When do we have is_slash_count? Two slashes (conditions 01 and 02) or one slash and one pipe (condition 04):

def is_slash_count(v1, v2):
    """conditions 01, 02, 04"""
    return "/" in v1 and "/" in v2 or "/" in v1 and "|" in v2 or "|" in v1 and "/" in v2

The function slash_count is simply the 2 x 2 table of conditions 01 and 02:

def slash_count(v1, v2):
    """count according to conditions 01, 02, 04"""
    cnt = collections.Counter()
    for x in re.split("[|/]", v1): # cartesian product
        for y in re.split("[|/]", v2): # cartesian product
            cnt[y+"g"+x] += 1
    return cnt # a dictionary XgY -> count(XgY)

The function pipe_count is even simpler, because you just have to count the result of pipe_compute:

def pipe_count(v1, v2):
    """count according to condition 03"""
    return collections.Counter(pipe_compute(v1, v2))

Now you're done (and down). I get this result, which is slightly different from your expectation, but you certainly have already seen my mistake(s?):

pos M1  M2  Mk  Mg1 H0  H1  S1  Sk1 S2  Sj
16229783    4-CgT   4-CgT   4-CgT   1-CgT   GgC CgT 0   1-CgT   1-CgT   1-CgT
16229992    4-AgC   4-AgC   4-AgC   1-AgC   GgG AgC 2-AgC   2-AgC   2-AgC   1-AgC
16230007    4-TgA   4-TgA   4-TgA   1-TgA   AgG TgA 2-TgA   2-TgA   2-TgA   0-TgA
16230011    4-GgT   4-GgT   4-GgT   2-GgT   CgA GgT 1-GgT   1-GgT   1-GgT   1-GgT
16230049    4-AgG   4-AgG   4-AgG   4-AgG   TgC AgG 1-AgG   0   1-AgG   1-AgG
16230174    0   0   0   4-CgA   TgT CgA 1-CgA   0   1-CgA   1-CgA
16230190    0   0   0   4-AgC   TgT AgC 0-AgC   0-AgC   0-AgC   0-AgC
16230260    4-AgA   4-AgA   4-AgA   4-AgA   GgT AgA 0-AgA   0-AgA   0-AgA   0-AgA

Bonus: Try it online!

What is important is, beyond the solution to this specific problem, the method I used and which is widely used in software development. The code may be improved a lot.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!