Performance optimization for processing of 115 million records for inserting into Oracle

问题

I have a requirement where I am reading a text file placed in Unix of size 19 GB and having records around 115 million. My Spring Batch (Launcher) is getting triggered by Autosys and Shell script once the file is placed in the location.

Initially on execution of this process it took around 72hrs to read, process(Null checks and date parsing) and write the data into the Oracle database.

But after certain configuration changes like using Throttle Limit, Task Executor etc, I was able to reduce execution time to 28hrs currently. I need this process to be complete in 4hrs,also, using SQL loader separately I am getting the work done in 35 minutes. But I have to using spring Batch only.

Can anyone tell me if it is possible to get it done in less than 4 hours using spring batch and what can be the best way to achieve that?

回答1:

In a project, I worked on, we had to transfer 5 billion records from db2 to oracle. With a quite complex transformation logic. During the transformation, the data was saved about 4 times in different files. We were able to insert data with about 50'000 records a row in an oracle db. From that point of view, doing it under 4 hours seems realistic.

You didn't state where exactly your bottlenecks are, but here are some ideas.

parallelisation - can you split up the file into chunks, which could be processid in parallel, for instance several instances of our job?
chunksize - we used a chunksize of 5000 to 10000 when writing to oracle
removing unnecessary data parsing, especially Date/Timestamp parsing - for instance, we had a lot of timestamps in our data, but they were not relevant for the processing logic. Since we had to read and write them from/to a file a couple of times during processing, we didn't parse, we just kept the string representation. Moreover, a lot of this timestamps had special values, like 1.1.0001 00:00:00.000000 or 31.12.9999 23.59.59.000000, we used LD or HD (for lowdate and highdate) to represent them.

HTH.

来源：https://stackoverflow.com/questions/32717610/performance-optimization-for-processing-of-115-million-records-for-inserting-int

标签

spring-batch