Store/retrieve a data structure

前端 未结 5 549
轮回少年
轮回少年 2020-12-31 11:58

I have implemented a suffix tree in Python to make full-text-searchs, and it\'s working really well. But there\'s a problem: the indexed text can be very bi

相关标签:
5条回答
  • 2020-12-31 12:12

    If pickle is already working for you, you may want to take a look at ZODB which adds some functionality on top of pickle. Looking at the documentation, I saw this paragraph that looks to address the size concerns you're having:

    The database moves objects freely between memory and storage. If an object has not been used in a while, it may be released and its contents loaded from storage the next time it is used.

    0 讨论(0)
  • 2020-12-31 12:12

    An effective way to manage a structure like this is to use a memory-mapped file. In the file, instead of storing references for the node pointers, you store offsets into the file. You can still use pickle to serialise the node data to a stream suitable for storing on disk, but you will want to avoid storing references since the pickle module will want to embed your entire tree all at once (as you've seen).

    Using the mmap module, you can map the file into address space and read it just like a huge sequence of bytes. The OS takes care of actually reading from the file and managing file buffers and all the details.

    You might store the first node at the start of the file, and have offsets that point to the next node(s). Reading the next node is just a matter of reading from the correct offset in the file.

    The advantage of memory-mapped files is that they aren't loaded into memory all at once, but only read from disk when needed. I've done this (on a 64-bit OS) with a 30 GB file on a machine with only 4 GB of RAM installed, and it worked fine.

    0 讨论(0)
  • 2020-12-31 12:25

    What about storing it in sqlite?

    SQLite:

    • has support for up to 2 terabytes of data,
    • supports SQL queries,
    • is great replacement for storing in-app data,
    • can support ~100k visits per day (if used for average web application),

    So it may be good to take a closer look at this solution, as it has proven to be the efficient way to store data within applications.

    0 讨论(0)
  • 2020-12-31 12:26

    Try a compressed suffix tree instead.

    The main idea is that instead of having lots of nodes of 1 character, you can compact them into 1 node of multiple characters thus saving extra nodes.

    This link here (http://www.cs.sunysb.edu/~algorith/implement/suds/implement.shtml) says you can transform a 160MB suffix tree to 33MB compressed suffix tree. Quite a gain.

    These compressed trees are used for genetic substring matching on huge strings. I used to run out of memory with a suffix tree, but after I compressed it, the out of memory error disappeared.

    I wish I could find an unpaid article which explains the implementation better. (http://dl.acm.org/citation.cfm?id=1768593)

    0 讨论(0)
  • 2020-12-31 12:27

    Maybe you could combine cPickle and a bsddb database that will allow you to store your pickled nodes in a dictionnary-like object that will be stored on the filesystem; thus you could load the database later and get from the nodes you need with really good performances.

    In a very simple way :

    import bsddb
    import cPickle
    
    db = bsddb.btopen('/tmp/nodes.db', 'c')
    def store_node(node, key, db):
        db[key] = cPickle.dumps(node)
    
    def get_node(key, db):
        return cPickle.loads(db[key])
    
    0 讨论(0)
提交回复
热议问题