Processing Large Files in Python [ 1000 GB or More]

后端 未结 8 985
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-15 06:27

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.

Is there any faster way to do this that the one i am using bellow

8条回答
  •  心在旅途
    2020-12-15 07:05

    Here is a third, longer method that uses a database. The database is sure to be larger than the text. I am not sure about if the indexes is optimal, and some space savings could come from playing with that a little. (like, maybe WORD, and POS, WORD are better, or perhaps WORD, POS is just fine, need to experiment a little).

    This may not perform well on 200 OK's test though because it is a lot of repeating text, but might perform well on more unique data.

    First create a database by scanning the words, etc:

    import sqlite3
    import re
    
    INPUT_FILENAME = 'bigfile.txt'
    DB_NAME = 'words.db'
    FLUSH_X_WORDS=10000
    
    
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    
    
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS WORDS (
         POS INTEGER
        ,WORD TEXT
        ,PRIMARY KEY( POS, WORD )
    ) WITHOUT ROWID
    """)
    
    cursor.execute("""
    DROP INDEX IF EXISTS I_WORDS_WORD_POS
    """)
    
    cursor.execute("""
    DROP INDEX IF EXISTS I_WORDS_POS_WORD
    """)
    
    
    cursor.execute("""
    DELETE FROM WORDS
    """)
    
    conn.commit()
    
    def flush_words(words):
        for word in words.keys():
            for pos in words[word]:
                cursor.execute('INSERT INTO WORDS (POS, WORD) VALUES( ?, ? )', (pos, word.lower()) )
    
        conn.commit()
    
    words = dict()
    pos = 0
    recomp = re.compile('\w+')
    with open(INPUT_FILENAME, 'r') as f:
        for line in f:
    
            for word in [x.lower() for x in recomp.findall(line) if x]:
                pos += 1
                if words.has_key(word):
                    words[word].append(pos)
                else:
                    words[word] = [pos]
            if pos % FLUSH_X_WORDS == 0:
                flush_words(words)
                words = dict()
        if len(words) > 0:
            flush_words(words)
            words = dict()
    
    
    cursor.execute("""
    CREATE UNIQUE INDEX I_WORDS_WORD_POS ON WORDS ( WORD, POS )
    """)
    
    cursor.execute("""
    CREATE UNIQUE INDEX I_WORDS_POS_WORD ON WORDS ( POS, WORD )
    """)
    
    cursor.execute("""
    VACUUM
    """)
    
    cursor.execute("""
    ANALYZE WORDS
    """)
    

    Then search the database by generating SQL:

    import sqlite3
    import re
    
    SEARCH_PHRASE = 'how fast it is'
    DB_NAME = 'words.db'
    
    
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    
    recomp = re.compile('\w+')
    
    search_list = [x.lower() for x in recomp.findall(SEARCH_PHRASE) if x]
    
    from_clause = 'FROM\n'
    where_clause = 'WHERE\n'
    num = 0
    fsep = '     '
    wsep = '     '
    for word in search_list:
        num += 1
        from_clause += '{fsep}words w{num}\n'.format(fsep=fsep,num=num)
        where_clause += "{wsep} w{num}.word = '{word}'\n".format(wsep=wsep, num=num, word=word)
        if num > 1:
            where_clause += "  AND w{num}.pos = w{lastnum}.pos + 1\n".format(num=str(num),lastnum=str(num-1))
    
        fsep = '    ,'
        wsep = '  AND'
    
    
    sql = """{select}{fromc}{where}""".format(select='SELECT COUNT(*)\n',fromc=from_clause, where=where_clause)
    
    res = cursor.execute( sql )
    
    print res.fetchone()[0] 
    

提交回复
热议问题