Merging two tables with millions of rows in Python

后端 未结 1 650
慢半拍i
慢半拍i 2020-12-05 08:47

I am using Python for some data analysis. I have two tables, the first (let\'s call it \'A\') has 10 million rows and 10 columns and the second (\'B\') has 73 million rows a

相关标签:
1条回答
  • 2020-12-05 09:02

    This is a little pseudo codish, but I think should be quite fast.

    Straightforward disk based merge, with all tables on disk. The key is that you are not doing selection per se, just indexing into the table via start/stop, which is quite fast.

    Selecting the rows that meet a criteria in B (using A's ids) won't be very fast, because I think it might be bringing the data into Python space rather than an in-kernel search (I am not sure, but you might want to investigate on pytables.org more in the in-kernel optimization section. There is a way to tell if it's going to be in-kernel or not).

    Also if you are up to it, this is a very parallel problem (just don't write the results to the same file from multiple processes. pytables is not write-safe for that).

    See this answer for a comment on how doing a join operation will actually be an 'inner' join.

    For your merge_a_b operation I think you can use a standard pandas join which is quite efficient (when in-memory).

    One other option (depending on how 'big' A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

    A = HDFStore('A.h5')
    B = HDFStore('B.h5')
    
    nrows_a = A.get_storer('df').nrows
    nrows_b = B.get_storer('df').nrows
    a_chunk_size = 1000000
    b_chunk_size = 1000000
    
    def merge_a_b(a,b):
        # Function that returns an operation on passed
        # frames, a and b.
        # It could be a merge, join, concat, or other operation that
        # results in a single frame.
    
    
    for a in xrange(int(nrows_a / a_chunk_size) + 1):
    
        a_start_i = a * a_chunk_size
        a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)
    
        a = A.select('df', start = a_start_i, stop = a_stop_i)
    
        for b in xrange(int(nrows_b / b_chunk_size) + 1):
    
            b_start_i = b * b_chunk_size
            b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
    
            b = B.select('df', start = b_start_i, stop = b_stop_i)
    
            # This is your result store
            m = merge_a_b(a, b)
    
            if len(m):
                store.append('df_result', m)
    
    0 讨论(0)
提交回复
热议问题