How to `sort` strings depending on the output value of a program?

我只是一个虾纸丫 提交于 2020-05-17 08:49:39

问题


I have text files containing many strings, one per line, that need to be sorted.

I am trying to use the sort command however it can only sort in alphabetic or numeric order, and I need something more specific.

Is it possible to use an external program to determine the ordering of items, something like sort --input=text.txt --evaluate=/bin/program?


回答1:


It is not supported by sort, or any sorting software that I know of.

It's not practically feasible because it is too resource intensive to start a process. Starting thousands of processes as required to compare thousands pairs of strings, would pretty much freeze the system for a short while.

How does sorting software work?

Consider a small text:

dog
cat
duck
mouse
...

The sorting requires to compare pairs of keys. Like dog vs cat, then dog vs duck, etc... to determine the relative order of items. It takes between N and N*N comparisons depending on the algorithm and whether items are already ordered.

In programming languages that provide a built-in sorting function, the developer has to provide a comparator function like int comp(string first, string second) that returns -1, 0 or +1 if the two strings are respectively in order, equals or in reverse order. (The equals case is very important for duplicates and stable sorting). See Python sorted(..., key) or C++ std::sort(..., comp).

It's theoretically possible to do the comparison based on an external binary /bin/compararator firstitem seconditem and exit code. (Ignoring issues with arguments being limited to short strings in a subset of ASCII characters).

It's practically too slow and will freeze the system the moment the sorting begins. It has to start a subprocess for (up to) every pair of strings. Starting a process is a very slow and very intensive task for the OS.

How slow can starting a process be?

A process takes in the order of 10 - 100 milliseconds to initialize, even on the most modern fastest CPU. A small sorting on thousands of strings could take whole minutes (millions of comparisons), whereas the normal in-memory sort can complete in milliseconds.

It is incredibly slow and inefficient to use an external binary for comparison, it doesn't make any sense to try to do that. But it gets worse too, bear with me.

Process creation is a very intensive task involving deep changes in the OS and kernel. The system will grind down to a halt, barely responsive if at all, while the processes are created endlessly. In this regard it is different from only burning CPU (with a while(1) for example) that is unarguably "bad" but doesn't dramatically affect other running tasks.

If a developer has had to implement worker pools, creating a hundred processes (or a thousand threads) to do some work. They may have noticed that their desktop is freezing pretty bad for a few seconds while the pool is created. It's so bad in fact that the common practice is to put a hard sleep to alleviate the system load for(n=0, n<100, n++) { startworker(); sleep(100ms); };. (Needless to say, sorting software would never complete if it were limited to a few comparison per second like that).

For historic reference. A common way to run web applications at the beginning of the internet was CGI. A simple interface starting a process on each incoming HTTP request, using standard input/output and environment variables to pass request info. It suffered from all the problems above (slowness, inefficiency and related DDoS issues) and quickly fell out of usage. Doesn't work well beyond a couple requests per second.



来源:https://stackoverflow.com/questions/61837691/how-to-sort-strings-depending-on-the-output-value-of-a-program

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!