Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

前端 未结 4 1525
梦谈多话
梦谈多话 2020-11-28 13:15

Our server produces files like {c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml in its log folder. The first part is GUID; the second part is name template.

4条回答
  •  时光取名叫无心
    2020-11-28 13:39

    How do you "merge the group files" in your approach? In worst case every line had a different name template so each group file had 5,000 lines in it and each merge doubles the number of lines until you overflow memory.

    Your friend is closer to the answer, those intermediate files need to be sorted so you can read them line by line and merge them to create new files without having to hold them all in memory. This is a well-known problem, it's an external sort. Once sorted you can count the results.

提交回复
热议问题