问题
I'm collecting time-indexed data coming from various files, but sometimes there is some overlapping:
df1 = pd.DataFrame([1, -1, -3], columns=['A'], index=pd.date_range('2000-01-01', periods=3))
df2 = pd.DataFrame([-3, 10, 1], columns=['A'], index=pd.date_range('2000-01-03', periods=3))
pd.concat([df1, df2])
A
2000-01-01 1
2000-01-02 -1
2000-01-03 -3
A
2000-01-03 -3
2000-01-04 10
2000-01-05 1
A
2000-01-01 1
2000-01-02 -1
2000-01-03 -3
2000-01-03 -3
2000-01-04 10
2000-01-05 1
1) How to clean and remove the duplicate lines ? (here 2000-01-03)
2) More generally, is there a faster / more clever way with pandas
to read and merge multiple csv files than doing manually:
L=[]
for f in glob.glob('*.csv'):
L.append(pd.read_csv(f, ...))
fulldata = pd.concat(L) # this can be time consuming
fulldata.remove_duplicate_lines() # this can be time consuming too
回答1:
IIUC you could do pd.concat and then do drop_duplicates:
In [104]: pd.concat([df1, df2]).drop_duplicates()
Out[104]:
A
2000-01-01 1
2000-01-02 -1
2000-01-03 -3
2000-01-04 10
2000-01-05 7
EDIT
You are right, that method isn't working properly because it drops by value not by index. For index you could duplicated for index
:
df = pd.concat([df1, df2])
df[~df.index.duplicated()]
In [107]: df[~df.index.duplicated()]
Out[107]:
A
2000-01-01 1
2000-01-02 -1
2000-01-03 -3
2000-01-04 10
2000-01-05 1
Or you could use 1st method with modification, first you need to do reset_index
, and then use drop_duplicates
but for index values with subset
key:
pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')
In [118]: pd.concat([df1, df2]).reset_index().drop_duplicates(subset='index').set_index('index')
Out[118]:
A
index
2000-01-01 1
2000-01-02 -1
2000-01-03 -3
2000-01-04 10
2000-01-05 1
回答2:
If you're feeling adventurous and decide to use something other than Pandas to combine CSVs, and you're on a machine with Awk, you can combine various files and remove duplicates with this single command:
awk '!arr[$0]++' /path/to/your/files/* > combined_no_dups.csv
And then you could load it into pandas...
df = pd.read_csv("combined_no_dups.csv")
来源:https://stackoverflow.com/questions/34197777/merge-dataframes-and-discard-duplicates-values