Pytables slow on query for non-matching string

问题

I'm relatively new in python and I'm using pytables to store some genomic annotations in hdf for faster query. I find querying a non-matching string in the table is slow, but I'm unsure how to optimize it for better performance.

Below shown is one of the tables:

In [5]: t
Out[5]: 
/gene/annotation (Table(315202,), fletcher32, blosc(5)) ''
  description := {
  "name": StringCol(itemsize=36, shape=(), dflt='', pos=0),
  "track": StringCol(itemsize=12, shape=(), dflt='', pos=1),
  "etype": StringCol(itemsize=12, shape=(), dflt='', pos=2),
  "event": StringCol(itemsize=36, shape=(), dflt='', pos=3)}
  byteorder := 'irrelevant'
  chunkshape := (1365,)
  autoindex := True
  colindexes := {
    "name": Index(9, full, shuffle, zlib(1)).is_csi=True}

When a condition matches something in the table, timeit returns in the microseconds.

In [6]: timeit [x for x in t.where("name == 'record_exists_in_table'")]
10000 loops, best of 3: 109 µs per loop

However, when I tried searching for a non-existence string, it is in the milliseconds.

In [8]: timeit [x for x in t.where("name == 'no_such_record'")]
10 loops, best of 3: 56 ms per loop

Any advice that points me toward the right direction will be greatly appreciated!

回答1:

I've exhausted my search on the web and yet to find anything that resolves the issue. So I've decided to use SeqIO.index_db() in biopython to create a separate index, then a check to make sure a condition will be found before executing a pytable query. Not exactly the pretty solution I was looking for, but this will do. It has substantially improved the performance on non-matching condition.

In [6]: timeit [x for x in t.where("name == 'not_found_in_table'")]
10 loops, best of 3: 51.6 ms per loop

In [9]: timeit [x for x in t.search_by_gene('not_found_in_table')]
10000 loops, best of 3: 29.5 µs per loop

来源：https://stackoverflow.com/questions/25386146/pytables-slow-on-query-for-non-matching-string

标签

python

pytables