I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t
If you can place this data in a sqlite3 database, selecting some number of random rows is trivial. You will not need to pre-read or pad lines in the file. Since sqlite data files are binary, you data file will be 1/3 to 1/2 smaller than CSV text.
You can use a script like THIS to import the CSV file or, better still, just write your data to a database table in the first place. SQLITE3 is part of the Python distribution.
Then use these statements to get 1,000,000 random rows:
mydb='csv.db'
con=sqlite3.connect(mydb)
with con:
cur=con.cursor()
cur.execute("SELECT * FROM csv ORDER BY RANDOM() LIMIT 1000000;")
for row in cur.fetchall():
# now you have random rows...