I have two tables. One (df below) has approximately 18,000 rows, and the other (mapfile below) has ~800,000 rows. I need a solution that can work with such large DataFrame
IIUC you can use read_csv and merge:
import pandas as pd
import io
temp1=u"""Sample;Chr;Start;End;Value
S1;1;100;200;1
S1;2;200;250;1
S2;1;50;75;5
S2;2;150;225;4"""
#after testing replace io.StringIO(temp1) to filename
dfline = pd.read_csv(io.StringIO(temp1), sep=";")
temp2=u"""Name;Chr;Position
P1;1;105
P2;1;60
P3;1;500
P4;2;25
P5;2;220
P6;2;240"""
#after testing replace io.StringIO(temp2) to filename
mapfile = pd.read_csv(io.StringIO(temp2), sep=";")
print dfline
Sample Chr Start End Value
0 S1 1 100 200 1
1 S1 2 200 250 1
2 S2 1 50 75 5
3 S2 2 150 225 4
print mapfile
Name Chr Position
0 P1 1 105
1 P2 1 60
2 P3 1 500
3 P4 2 25
4 P5 2 220
5 P6 2 240
#merge by column Chr
df = pd.merge(dfline, mapfile, on=['Chr'])
#select by conditions
df = df[(df.Position > df.Start) & (df.Position < df.End)]
#subset of df
df = df[['Name','Chr','Position','Value', 'Sample']]
print df
Name Chr Position Value Sample
0 P1 1 105 1 S1
4 P2 1 60 5 S2
7 P5 2 220 1 S1
8 P6 2 240 1 S1
10 P5 2 220 4 S2
#if you need reset index
print df.reset_index(drop=True)
Name Chr Position Value Sample
0 P1 1 105 1 S1
1 P2 1 60 5 S2
2 P5 2 220 1 S1
3 P6 2 240 1 S1
4 P5 2 220 4 S2