In the following data, I am trying to run a simple markov model.
Say I have a data with following structure:
pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T
Block M represents data from one set of catergories, so does block S.
The data are the strings which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.
There is also one hybrid block that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?
I am trying to build a markov model which can help me identify which string in hybrid block came from which blocks. In this example I can tell that in hybrid block ATCG came from block M and CAGT came from block S.
I am breaking the problem into different parts to read and mine the data:
Problem level 01:
- First I read the first line (the header) and create
unique keysfor all the columns. - Then I read the 2nd line (
poswith value 1) and create another key. In the same line I read the value fromhybrid_blockand read the strings value in it. Thepipe |is just a separator, so two strings are inindex 0 and 2asAandC. So, all I want from this line is a
defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}
As, I progress with reading the line, I want to append the strings values from each column and finally create.
defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}
Problem level 02:
I read the data in
hybrid_blockfor the first line which areA and C.Now, I want to create
keys' but unlike fixed keys, these key will be generated while reading the data fromhybrid_blocks. For the first line since there are no preceding line thekeyswill simply beAgAandCgCwhich means (A given A, and C given C), and for the values I count the number ofAinblock Mandblock S`. So, the data will be stored as:
defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}
As, I read through other lines I want to create new keys based on what are the strings in hybrid block and count the number of times that string was present in M vs S block given the string in preceeding line. That means the keys while reading line 2 would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I foundT in this line, after A in the previous lineand same forAcG`.
The defaultdict after reading 3 lines would be.
defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}
I understand this looks too complicated. I went through several dictionary and defaultdict tutorial but couldn't find a way of doing this.
Solution to any part if not both is highly appreciated.
pandas setup
from io import StringIO
import pandas as pd
import numpy as np
txt = """pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T """
df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')
df
solution
mostly pandas with some numpy
- split hybrid column
- prepend identical first row
- add with shifted version of self to get
'AgA'type strings
d1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])
d1 = pd.concat([
df.filter(like='M'),
df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
df.filter(like='S')
], axis=1)
d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()
d1
Assign convenient blocks to their own variable names
m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')
Count how many are in each block and concatenate
mcounts = pd.DataFrame(
(m.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
scounts = pd.DataFrame(
(s.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts
If you really want a dictionary
d = defaultdict(lambda:defaultdict(list))
dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))
dict(d)
{'M': defaultdict(list,
{'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
'S': defaultdict(list,
{'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}
来源:https://stackoverflow.com/questions/41929351/how-to-read-two-lines-from-a-file-and-create-dynamics-keys-in-a-for-loop


