Join two large files by column in python

穿精又带淫゛_ 提交于 2020-01-04 05:48:05

问题


I have 2 files with 38374732 lines in each and size 3.3 G each. I am trying to join them on the first column. For doing so I decided to use pandas with the following code that pulled from Stackoverflow:

 import pandas as pd
 import sys
 a = pd.read_csv(sys.argv[1],sep='\t',encoding="utf-8-sig")
 b = pd.read_csv(sys.argv[2],sep='\t',encoding="utf-8-sig")
 chunksize = 10 ** 6
 for chunk in a(chunksize=chunksize):
   merged = chunk.merge(b, on='Bin_ID')
   merged.to_csv("output.csv", index=False,sep='\t')

However I am getting memory error(not surprising). I looked up at the code with chunks for pandas (something like this How to read a 6 GB csv file with pandas), however how do I implement it for two files in a loop and I don't think I can chunk the second file as I need to lookup for column in the whole second file.Is there a way out for this?


回答1:


This is already discussed in other posts like the one you mentioned (this, or this, or this).

As it is explained there, I would try to use dask dataframe to load the data and execute the merge, but depending on your PC you may still not be able to do it.

Minimum working example:

import dask.dataframe as dd

# Read the CSVs
df1 = dd.read_csv('data1.csv')
df2 = dd.read_csv('data2.csv')

# Merge them
df = dd.merge(df1, df2, on='Bin_ID').compute()

# Save the merged dataframe
df.to_csv('merged.csv', index=False)


来源:https://stackoverflow.com/questions/50101772/join-two-large-files-by-column-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!