发表新帖

发表新帖

Processing Large Files in Python [ 1000 GB or More]

后端未结

关注

 8  976

佛祖请我去吃肉 2020-12-15 06:27

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.

Is there any faster way to do this that the one i am using bellow

8条回答

遥遥无期 (楼主)

2020-12-15 07:22
Had you considered indexing your file? The way search engine works is by creating a mapping from words to the location they are in the file. Say if you have this file:
```
Foo bar baz dar. Dar bar haa.
```
You create an index that looks like this:
```
{
    "foo": {0},
    "bar": {4, 21},
    "baz": {8},
    "dar": {12, 17},
    "haa": {25},
}
```
A hashtable index can be looked up in O(1); so it's freaking fast.

And someone searches for the query "bar baz" you first break the query into its constituent words: ["bar", "baz"] and you then found {4, 21}, {8}; then you use this to jump out right to the places where the queried text could possible exists.

There are out of the box solutions for indexed search engines as well; for example Solr or ElasticSearch.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...

热议问题