发表新帖

发表新帖

Process very large (>20GB) text file line by line

后端未结

关注

 11  1733

慢半拍i 2020-11-29 17:54

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

11条回答

[愿得一人] (楼主)

2020-11-29 18:29
Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.

You can start to play with the idea with sqlite3 which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.

Create a database
```
 conn = sqlite3.connect('pts.db')
 c = conn.cursor()
```
Creates a table
```
c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')
```
Then use one of the algorithms above to insert all the lines and points in the database by calling
```
c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")
```
Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query
```
c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")
```
And get n lines at a time from this query with
```
c.fetchmany(size=n)
```
I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.
0 讨论(0)

查看其它11个回答
发布评论:

提交评论
- 加载中...

热议问题