Merging two tables with millions of rows in Python

后端未结

关注

 1  655

I am using Python for some data analysis. I have two tables, the first (let\'s call it \'A\') has 10 million rows and 10 columns and the second (\'B\') has 73 million rows a

相关标签:

1条回答

野性不改

2020-12-05 09:02
This is a little pseudo codish, but I think should be quite fast.

Straightforward disk based merge, with all tables on disk. The key is that you are not doing selection per se, just indexing into the table via start/stop, which is quite fast.

Selecting the rows that meet a criteria in B (using A's ids) won't be very fast, because I think it might be bringing the data into Python space rather than an in-kernel search (I am not sure, but you might want to investigate on pytables.org more in the in-kernel optimization section. There is a way to tell if it's going to be in-kernel or not).

Also if you are up to it, this is a very parallel problem (just don't write the results to the same file from multiple processes. pytables is not write-safe for that).

See this answer for a comment on how doing a join operation will actually be an 'inner' join.

For your merge_a_b operation I think you can use a standard pandas join which is quite efficient (when in-memory).

One other option (depending on how 'big' A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries
```
A = HDFStore('A.h5')
B = HDFStore('B.h5')

nrows_a = A.get_storer('df').nrows
nrows_b = B.get_storer('df').nrows
a_chunk_size = 1000000
b_chunk_size = 1000000

def merge_a_b(a,b):
    # Function that returns an operation on passed
    # frames, a and b.
    # It could be a merge, join, concat, or other operation that
    # results in a single frame.


for a in xrange(int(nrows_a / a_chunk_size) + 1):

    a_start_i = a * a_chunk_size
    a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)

    a = A.select('df', start = a_start_i, stop = a_stop_i)

    for b in xrange(int(nrows_b / b_chunk_size) + 1):

        b_start_i = b * b_chunk_size
        b_stop_i = min((b + 1) * b_chunk_size, nrows_b)

        b = B.select('df', start = b_start_i, stop = b_stop_i)

        # This is your result store
        m = merge_a_b(a, b)

        if len(m):
            store.append('df_result', m)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...