Read random lines from huge CSV file in Python

前端 未结 11 1476
独厮守ぢ
独厮守ぢ 2020-12-05 02:27

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t

11条回答
  •  时光说笑
    2020-12-05 02:50

    If you can place this data in a sqlite3 database, selecting some number of random rows is trivial. You will not need to pre-read or pad lines in the file. Since sqlite data files are binary, you data file will be 1/3 to 1/2 smaller than CSV text.

    You can use a script like THIS to import the CSV file or, better still, just write your data to a database table in the first place. SQLITE3 is part of the Python distribution.

    Then use these statements to get 1,000,000 random rows:

    mydb='csv.db'
    con=sqlite3.connect(mydb)
    
    with con:
        cur=con.cursor()
        cur.execute("SELECT * FROM csv ORDER BY RANDOM() LIMIT 1000000;")
    
        for row in cur.fetchall():
            # now you have random rows...
    

提交回复
热议问题