Java ETL process

十年热恋 提交于 2019-12-04 21:53:18

100M rows is quite a lot. You can design it in plenty of ways: REST servers, JDBC reading, Spring Batch, Spring integration, Hibernate, ETL. But the bottom line is: time.

No matter what architecture you choose, you eventually have to perform these INSERTs into MySQL. Your mileage may vary but just to give you an order of magnitude: with 2K inserts per second it'll take half a day to populate MySQL with 100M rows (source).

According to the same source LOAD DATA INFILE can handle around 25K inserts/second (roughly 10x more and about an hour of work).

That being said with such an amount of data I would suggest:

  • dump Oracle table using native Oracle database tools that produce human readable content (or computer readable, but you have to be able to parse it)

  • parse the dump file using as fast tools as you can. Maybe grep/sed/gawk/cut will be enough?

  • generate target file compatible with MySQL LOAD DATA INFILE (it is very configurable)

  • Import the file in MySQL using aforementioned command

Of course you can do this in Java with nice and readable code, unit tested and versioned. But with this amount of data you need to be pragmatic.

That is for initial load. After that probably Spring Batch will be a good choice. If you can, try to connect your application directly to both databases - again, this will be faster. On the other hand this might not be possible for security reasons.

If you want to be very flexible and not tie yourself into databases directly, expose both input (Oracle) and output (MySQL) behind web-services (REST is fine as well). Spring integration will help you a lot.

You can use Scriptella to transfer data between databases. Here is an example of a XML transformation file:

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
    <connection id="in" url="jdbc:oracle:thin:@localhost:1521:ORCL" 
              classpath="ojdbc14.jar" user="scott" password="tiger"/>

    <connection id="out" url="jdbc:mysql://localhost:3306/fromdb" 
              classpath="mysql-connector.jar" user="user" password="password"/>
    <!-- Copy all table rows from one to another database -->
    <query connection-id="in">
        SELECT * FROM Src_Table
        <!-- For each row executes insert -->  
        <script connection-id="out"> 
            INSERT INTO Dest_Table(ID, Name) VALUES (?id,?name)
        </script>
    </query>
  </etl>
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!