Processing Large Files in Python [ 1000 GB or More]

后端 未结 8 976
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-15 06:27

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.

Is there any faster way to do this that the one i am using bellow

8条回答
  •  遥遥无期
    2020-12-15 07:22

    Had you considered indexing your file? The way search engine works is by creating a mapping from words to the location they are in the file. Say if you have this file:

    Foo bar baz dar. Dar bar haa.
    

    You create an index that looks like this:

    {
        "foo": {0},
        "bar": {4, 21},
        "baz": {8},
        "dar": {12, 17},
        "haa": {25},
    }
    

    A hashtable index can be looked up in O(1); so it's freaking fast.

    And someone searches for the query "bar baz" you first break the query into its constituent words: ["bar", "baz"] and you then found {4, 21}, {8}; then you use this to jump out right to the places where the queried text could possible exists.

    There are out of the box solutions for indexed search engines as well; for example Solr or ElasticSearch.

提交回复
热议问题