Python Pandas - Compare 2 dataframes, multiple parameters

后端 未结 1 1424
-上瘾入骨i
-上瘾入骨i 2020-12-20 07:12

I have two tables. One (df below) has approximately 18,000 rows, and the other (mapfile below) has ~800,000 rows. I need a solution that can work with such large DataFrame

相关标签:
1条回答
  • 2020-12-20 07:43

    IIUC you can use read_csv and merge:

    import pandas as pd
    import io
    
    temp1=u"""Sample;Chr;Start;End;Value
    S1;1;100;200;1
    S1;2;200;250;1
    S2;1;50;75;5
    S2;2;150;225;4"""
    #after testing replace io.StringIO(temp1) to filename
    dfline = pd.read_csv(io.StringIO(temp1), sep=";")
    
    temp2=u"""Name;Chr;Position
    P1;1;105
    P2;1;60
    P3;1;500
    P4;2;25
    P5;2;220
    P6;2;240"""
    #after testing replace io.StringIO(temp2) to filename
    mapfile = pd.read_csv(io.StringIO(temp2), sep=";")
    
    print dfline
      Sample  Chr  Start  End  Value
    0     S1    1    100  200      1
    1     S1    2    200  250      1
    2     S2    1     50   75      5
    3     S2    2    150  225      4
    print mapfile
      Name  Chr  Position
    0   P1    1       105
    1   P2    1        60
    2   P3    1       500
    3   P4    2        25
    4   P5    2       220
    5   P6    2       240
    
    #merge by column Chr
    df = pd.merge(dfline, mapfile, on=['Chr'])
    
    #select by conditions
    df = df[(df.Position > df.Start) & (df.Position < df.End)]
    
    #subset of df
    df =  df[['Name','Chr','Position','Value', 'Sample']]
    
    print df
       Name  Chr  Position  Value Sample
    0    P1    1       105      1     S1
    4    P2    1        60      5     S2
    7    P5    2       220      1     S1
    8    P6    2       240      1     S1
    10   P5    2       220      4     S2
    
    #if you need reset index
    print df.reset_index(drop=True)
      Name  Chr  Position  Value Sample
    0   P1    1       105      1     S1
    1   P2    1        60      5     S2
    2   P5    2       220      1     S1
    3   P6    2       240      1     S1
    4   P5    2       220      4     S2
    
    0 讨论(0)
提交回复
热议问题