问题
I'm relatively new in python and I'm using pytables to store some genomic annotations in hdf for faster query. I find querying a non-matching string in the table is slow, but I'm unsure how to optimize it for better performance.
Below shown is one of the tables:
In [5]: t
Out[5]:
/gene/annotation (Table(315202,), fletcher32, blosc(5)) ''
description := {
"name": StringCol(itemsize=36, shape=(), dflt='', pos=0),
"track": StringCol(itemsize=12, shape=(), dflt='', pos=1),
"etype": StringCol(itemsize=12, shape=(), dflt='', pos=2),
"event": StringCol(itemsize=36, shape=(), dflt='', pos=3)}
byteorder := 'irrelevant'
chunkshape := (1365,)
autoindex := True
colindexes := {
"name": Index(9, full, shuffle, zlib(1)).is_csi=True}
When a condition matches something in the table, timeit returns in the microseconds.
In [6]: timeit [x for x in t.where("name == 'record_exists_in_table'")]
10000 loops, best of 3: 109 µs per loop
However, when I tried searching for a non-existence string, it is in the milliseconds.
In [8]: timeit [x for x in t.where("name == 'no_such_record'")]
10 loops, best of 3: 56 ms per loop
Any advice that points me toward the right direction will be greatly appreciated!
回答1:
I've exhausted my search on the web and yet to find anything that resolves the issue. So I've decided to use SeqIO.index_db()
in biopython to create a separate index, then a check to make sure a condition will be found before executing a pytable query. Not exactly the pretty solution I was looking for, but this will do. It has substantially improved the performance on non-matching condition.
In [6]: timeit [x for x in t.where("name == 'not_found_in_table'")]
10 loops, best of 3: 51.6 ms per loop
In [9]: timeit [x for x in t.search_by_gene('not_found_in_table')]
10000 loops, best of 3: 29.5 µs per loop
来源:https://stackoverflow.com/questions/25386146/pytables-slow-on-query-for-non-matching-string