database sort vs. programmatic java sort

后端 未结 9 2356
自闭症患者
自闭症患者 2020-12-04 15:28

I want to get data from the database (MySQL) by JPA, I want it sorted by some column value.

So, what is the best practice, to:

  • Retrieve the data from t
9条回答
  •  被撕碎了的回忆
    2020-12-04 15:41

    I ran into this very same question, and decided that I should run a little benchmark to quantify the speed differences. The results surprised me. I would like to post my experience with this very sort of question.

    As with a number of the other posters here, my thought was that the database layer would do the sort faster because they are supposedly tuned for this sort of thing. @Alex made a good point that if the database already has an index on the sort, then it will be faster. I wanted to answer the question which raw sorting is faster on non-indexed sorts. Note, I said faster, not simpler. I think in many cases letting the db do the work is simpler and less error prone.

    My main assumption was that the sort would fit in main memory. Not all problems will fit here, but a good number do. For out of memory sorts, it may well be that databases shine here, though I did not test that. In the case of in memory sorts all of java/c/c++ outperformed mysql in my informal benchmark, if one could call it that.

    I wish I had had more time to more thoroughly compare the database layer vs application layer, but alas other duties called. Still, I couldn't help but record this note for others who are traveling down this road.

    As I started down this path I started to see more hurdles. Should I compare data transfer? How? Can I compare time to read db vs time to read a flat file in java? How to isolate the sort time vs data transfer time vs time to read the records? With these questions here was the methodology and timing numbers I came up with.

    All times in ms unless otherwise posted

    All sort routines were the defaults provided by the language (these are good enough for random sorted data)

    All compilation was with a typical "release-profile" selected via netbeans with no customization unless otherwise posted

    All tests for mysql used the following schema

     mysql> CREATE TABLE test_1000000
     (
     pk bigint(11) NOT NULL,
     float_value DOUBLE NULL,
     bigint_value     bigint(11)  NULL,
     PRIMARY KEY (pk )
     ) Engine MyISAM;
    
    mysql> describe test_1000000;
    +--------------+------------+------+-----+---------+-------+
    | Field        | Type       | Null | Key | Default | Extra |
    +--------------+------------+------+-----+---------+-------+
    | pk           | bigint(11) | NO   | PRI | NULL    |       |
    | float_value  | double     | YES  |     | NULL    |       |
    | bigint_value | bigint(11) | YES  |     | NULL    |       |
    +--------------+------------+------+-----+---------+-------+
    

    First here is a little snippet to populate the DB. There may be easier ways, but this is what I did:

    public static void BuildTable(Connection conn, String tableName, long iterations) {
        Random ran = new Random();
        Math.random();
        try {
    
    
            long epoch = System.currentTimeMillis();
            for (long i = 0; i < iterations; i++) {
                if (i % 100000 == 0) {
                    System.out.println(i + " next 100k");
                }
                PerformQuery(conn, tableName, i, ran.nextDouble(), ran.nextLong());
            }
    
        } catch (Exception e) {
            logger.error("Caught General Exception Error from main " + e);
    
        }
    }
    

    MYSQL Direct CLI results:

    select * from test_10000000 order by bigint_value limit 10;
    10 rows in set (2.32 sec)
    

    These timings were somewhat difficult as the only info I had was the time reported after the execution of the command.

    from mysql prompt for 10000000 elements it is roughly 2.1 to 2.4 either for sorting bigint_value or float_value

    Java JDBC mysql call (similar performance to doing sort from mysql cli)

    public static void SortDatabaseViaMysql(Connection conn, String tableName) {
    
        try {
            Statement stmt = conn.createStatement();
            String cmd = "SELECT * FROM " + tableName + " order by float_value limit 100";
    
    
            ResultSet rs = stmt.executeQuery(cmd);
        } catch (Exception e) {
    
        }
    
    }
    

    Five runs:

    da=2379 ms
    da=2361 ms
    da=2443 ms
    da=2453 ms
    da=2362 ms
    

    Java Sort Generating random numbers on fly (actually was slower than disk IO read). Assignment time is the time to generate random numbers and populate the array

    Calling like

    JavaSort(10,10000000);
    

    Timing results:

    assignment time 331  sort time 1139
    assignment time 324  sort time 1037
    assignment time 317  sort time 1028
    assignment time 319  sort time 1026
    assignment time 317  sort time 1018
    assignment time 325  sort time 1025
    assignment time 317  sort time 1024
    assignment time 318  sort time 1054
    assignment time 317  sort time 1024
    assignment time 317  sort time 1017
    

    These results were for reading a file of doubles in binary mode

    assignment time 4661  sort time 1056
    assignment time 4631  sort time 1024
    assignment time 4733  sort time 1004
    assignment time 4725  sort time 980
    assignment time 4635  sort time 980
    assignment time 4725  sort time 980
    assignment time 4667  sort time 978
    assignment time 4668  sort time 980
    assignment time 4757  sort time 982
    assignment time 4765  sort time 987
    

    Doing a buffer transfer results in much faster runtimes

    assignment time 77  sort time 1192
    assignment time 59  sort time 1125
    assignment time 55  sort time 999
    assignment time 55  sort time 1000
    assignment time 56  sort time 999
    assignment time 54  sort time 1010
    assignment time 55  sort time 999
    assignment time 56  sort time 1000
    assignment time 55  sort time 1002
    assignment time 56  sort time 1002
    

    C and C++ Timing results (see below for source)

    Debug profile using qsort

    assignment 0 seconds 110 milliseconds   Time taken 2 seconds 340 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 2 seconds 340 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 2 seconds 330 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 2 seconds 340 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 2 seconds 330 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 2 seconds 340 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 2 seconds 340 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 2 seconds 330 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 2 seconds 340 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 2 seconds 330 milliseconds
    

    Release profile using qsort

    assignment 0 seconds 100 milliseconds   Time taken 1 seconds 600 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 600 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 580 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 590 milliseconds
    assignment 0 seconds 80 milliseconds    Time taken 1 seconds 590 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 590 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 600 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 590 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 600 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 1 seconds 580 milliseconds
    

    Release profile Using std::sort( a, a + ARRAY_SIZE );

    assignment 0 seconds 100 milliseconds   Time taken 0 seconds 880 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 0 seconds 870 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 0 seconds 890 milliseconds
    assignment 0 seconds 120 milliseconds   Time taken 0 seconds 890 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 0 seconds 890 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 0 seconds 880 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 0 seconds 900 milliseconds
    assignment 0 seconds 90 milliseconds    Time taken 0 seconds 890 milliseconds
    assignment 0 seconds 100 milliseconds   Time taken 0 seconds 890 milliseconds
    assignment 0 seconds 150 milliseconds   Time taken 0 seconds 870 milliseconds
    

    Release profile Reading random data from file and using std::sort( a, a + ARRAY_SIZE )

    assignment 0 seconds 50 milliseconds    Time taken 0 seconds 880 milliseconds
    assignment 0 seconds 40 milliseconds    Time taken 0 seconds 880 milliseconds
    assignment 0 seconds 50 milliseconds    Time taken 0 seconds 880 milliseconds
    assignment 0 seconds 50 milliseconds    Time taken 0 seconds 880 milliseconds
    assignment 0 seconds 40 milliseconds    Time taken 0 seconds 880 milliseconds
    

    Below is the source code used. Hopefully minimal bugs :)

    Java source Note that internal to JavaSort the runCode and writeFlag need to be adjusted depending on what you want to time. Also note that the memory allocation happens in the for loop (thus testing GC, but I did not see any appreciable difference moving the allocation outside the loop)

    public static void JavaSort(int iterations, int numberElements) {
    
        Random ran = new Random();
        Math.random();
        int runCode = 2;
        boolean writeFlag = false;
        for (int j = 0; j < iterations; j++) {
            double[] a1 = new double[numberElements];
            long timea = System.currentTimeMillis();
            if (runCode == 0) {
                for (int i = 0; i < numberElements; i++) {
                    a1[i] = ran.nextDouble();
    
                }
            }            
            else if (runCode == 1) {
    
                //do disk io!!
                try {
                DataInputStream in = new DataInputStream(new FileInputStream("MyBinaryFile.txt"));
                int i = 0;
                //while (in.available() > 0) {
                while (i < numberElements) { //this should be changed so that I always read in the size of array elements
                    a1[i++] = in.readDouble();
                }
                }
                catch (Exception e) {
    
                }
    
            }
            else if (runCode == 2) {
                try  {
                    FileInputStream stream = new FileInputStream("MyBinaryFile.txt");
                    FileChannel inChannel = stream.getChannel();
    
                    ByteBuffer buffer = inChannel.map(FileChannel.MapMode.READ_ONLY, 0, inChannel.size());
                    //int[] result = new int[500000];
    
                    buffer.order(ByteOrder.BIG_ENDIAN);
                    DoubleBuffer doubleBuffer = buffer.asDoubleBuffer();
                    doubleBuffer.get(a1);
                }
                catch (Exception e) {
    
                }
            }
    
            if (writeFlag) {
                try {
                    DataOutputStream out = new DataOutputStream(new FileOutputStream("MyBinaryFile.txt"));
                    for (int i = 0; i < numberElements; i++) {
                        out.writeDouble(a1[i]);
                    }
                } catch (Exception e) {
    
                }
            }
            long timeb = System.currentTimeMillis();
            Arrays.sort(a1);
    
            long timec = System.currentTimeMillis();
            System.out.println("assignment time " + (timeb - timea) + " " + " sort time " + (timec - timeb));
            //delete a1;
        }
    }
    

    C/C++ source

    #include 
    #include 
    #include 
    #include 
    
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    
    #define ARRAY_SIZE 10000000
    
    using namespace std;
    
    int compa(const void * elem1, const void * elem2) {
        double f = *((double*) elem1);
        double s = *((double*) elem2);
        if (f > s) return 1;
        if (f < s) return -1;
        return 0;
    }
    
    int compb (const void *a, const void *b) {
       if (*(double **)a < *(double **)b) return -1;
       if (*(double **)a > *(double **)b) return 1;
       return 0;
    }
    
    void timing_testa(int iterations) {
    
        clock_t start = clock(), diffa, diffb;
    
        int msec;
        bool writeFlag = false;
        int runCode = 1;
    
        for (int loopCounter = 0; loopCounter < iterations; loopCounter++) {
            double *a = (double *) malloc(sizeof (double)*ARRAY_SIZE);
            start = clock();
            size_t bytes = sizeof (double)*ARRAY_SIZE;
            if (runCode == 0) {
                for (int i = 0; i < ARRAY_SIZE; i++) {
                    a[i] = rand() / (RAND_MAX + 1.0);
                }
            }
            else if (runCode == 1) {
                ifstream inlezen;
    
                inlezen.open("test", ios::in | ios::binary);
    
    
                inlezen.read(reinterpret_cast (&a[0]), bytes);
    
            }
            if (writeFlag) {
                ofstream outf;
                const char* pointer = reinterpret_cast(&a[0]);
                outf.open("test", ios::out | ios::binary);
                outf.write(pointer, bytes);
                outf.close();
    
            }
    
            diffa = clock() - start;
            msec = diffa * 1000 / CLOCKS_PER_SEC;
            printf("assignment %d seconds %d milliseconds\t", msec / 1000, msec % 1000);
            start = clock();
            //qsort(a, ARRAY_SIZE, sizeof (double), compa);
            std::sort( a, a + ARRAY_SIZE );
            //printf("%f %f %f\n",a[0],a[1000],a[ARRAY_SIZE-1]);
            diffb = clock() - start;
    
            msec = diffb * 1000 / CLOCKS_PER_SEC;
            printf("Time taken %d seconds %d milliseconds\n", msec / 1000, msec % 1000);
            free(a);
        }
    
    
    
    }
    
    /*
     * 
     */
    int main(int argc, char** argv) {
    
        printf("hello world\n");
        double *a = (double *) malloc(sizeof (double)*ARRAY_SIZE);
    
    
        //srand(1);//change seed to fix it
        srand(time(NULL));
    
        timing_testa(5);
    
    
    
        free(a);
        return 0;
    }
    

提交回复
热议问题