Process very large (>20GB) text file line by line

后端 未结 11 1733
慢半拍i
慢半拍i 2020-11-29 17:54

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

11条回答
  •  [愿得一人]
    2020-11-29 18:29

    Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.

    You can start to play with the idea with sqlite3 which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.

    Create a database

     conn = sqlite3.connect('pts.db')
     c = conn.cursor()
    

    Creates a table

    c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')
    

    Then use one of the algorithms above to insert all the lines and points in the database by calling

    c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")
    

    Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query

    c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")
    

    And get n lines at a time from this query with

    c.fetchmany(size=n)
    

    I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.

提交回复
热议问题