Join two large files by column in python

问题

I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:

 import pandas as pd
 import sys
 a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
 b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
 chunksize = 10 ** 6
 for chunk in a(chunksize=chunksize):
   merged = chunk.merge(b, on='Bin_ID')
   merged.to_csv("output.csv", index=False,sep='\t')

However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?

回答1:

This is already discussed in other posts like the one you mentioned (this, or this, or this).

As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.

Minimum working example:

import dask.dataframe as dd

# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')

# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()

# Save the merged dataframe
df.to_csv('merged.csv', index=False)

来源：https://stackoverflow.com/questions/50101772/join-two-large-files-by-column-in-python

标签

python

pandas

join

dask

chunks