Batch processing files and diff with database

问题

Currently I am developing a Spring-Boot application that is periodically trying to process a file containing user data where every line contains userId and departamentId separated by | for example 123534|13. That file will be containing few milions of records.

My requirement is to load this data into mysql database in such manner that:

If the user with processed ID exists, do not do anything
If the user is not existing create new user
If the user is not on the list but is present in the database, remove it
If department with current is not present on the database, create it

I have made some optimization like

Caching Departaments to fill entity
Batch collecting Users to Save and saving it throuch JpaRepository saveAll method

But Still I am making too many database calls to the database, I am checking if user is present in order to create entity for save for every record...

My Entities are rather simple:

@Entity
@Table(name = "departaments")
public class Departament{

    @Id
    @Column(name = "id")
    private Long id;

    @Column(name = "name")
    private String name;

and:

@Entity
@Table(name = "users")
public class User {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    @Column(name = "id")
    private Long id;

    @ManyToOne
    @JoinColumn(name = "departament_id")
    private Departament departament;

Has anyone come across problem like that?

Could it be optimized more?

Is there any good pattern for processing?

回答1:

A couple of things here:

It would seem your primary source of truth for users is the CSV file. Why not simply truncate and recreate the USER table? You may run into some problems (I understand that referential integrity is not one of them in your scenario - or is it?), but you'll get user deletion for free (TBH I can't quite picture how you handle user deletion in the current setup). It would run even faster with key checks disabled
Are you actually seeing performance improvements when using saveAll? This does not limit the number of SELECT statements to execute
Are you sure you're operating on the right level of abstraction? Perhaps you could use plain JDBC instead of JPA. with JPA, there will be a lot of caching/mapping involved, causing significant overhead. With JDBC you could leverage MySQL's INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE statements to get what you want
If you go for any of the above, you could try using Spring Batch for more declarative processing

回答2:

If it is a "replacement", do this to avoid any downtime:

CREATE TABLE new LIKE old;
LOAD DATA INFILE ... (and any other massaging)
RENAME TABLE real TO old, new TO real;
DROP TABLE old;

If it is a "delta", the LOAD it into a separate table, then execute suitable SQL statements to perform the updates. It would be approximately one SQL statement per bullet item in your Question. No loops.

回答3:

Just try to do it simply using Spring-batch

context.xml

<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="
        http://www.springframework.org/schema/beans 
        http://www.springframework.org/schema/beans/spring-beans-3.2.xsd">

    <!-- stored job-meta in database -->
    <bean id="jobRepository"
        class="org.springframework.batch.core.repository.support.JobRepositoryFactoryBean">
        <property name="dataSource" ref="dataSource" />
        <property name="transactionManager" ref="transactionManager" />
        <property name="databaseType" value="mysql" />
    </bean>

    <bean id="jobLauncher"
        class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
        <property name="jobRepository" ref="jobRepository" />
    </bean>

</beans>

database.xml

<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:jdbc="http://www.springframework.org/schema/jdbc" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.springframework.org/schema/beans 
        http://www.springframework.org/schema/beans/spring-beans-3.2.xsd
        http://www.springframework.org/schema/jdbc 
        http://www.springframework.org/schema/jdbc/spring-jdbc-3.2.xsd">

    <!-- connect to database -->
    <bean id="dataSource"
        class="org.springframework.jdbc.datasource.DriverManagerDataSource">
        <property name="driverClassName" value="com.mysql.jdbc.Driver" />
        <property name="url" value="jdbc:mysql://localhost:3306/test" />
        <property name="username" value="root" />
        <property name="password" value="User@1234" />
    </bean>

    <bean id="transactionManager"
        class="org.springframework.batch.support.transaction.ResourcelessTransactionManager" />

    <!-- create job-meta tables automatically -->
    <jdbc:initialize-database data-source="dataSource">
        <jdbc:script location="org/springframework/batch/core/schema-drop-mysql.sql" />
        <jdbc:script location="org/springframework/batch/core/schema-mysql.sql" />
    </jdbc:initialize-database>

</beans>

jobReport.xml

<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:batch="http://www.springframework.org/schema/batch" 
    xmlns:task="http://www.springframework.org/schema/task"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.springframework.org/schema/batch
        http://www.springframework.org/schema/batch/spring-batch-2.2.xsd
        http://www.springframework.org/schema/beans 
        http://www.springframework.org/schema/beans/spring-beans-3.2.xsd">

    <bean id="report" class="com.om.model.Report" scope="prototype" />

    <batch:job id="reportJob">
        <batch:step id="step1">
            <batch:tasklet>
                <batch:chunk reader="cvsFileItemReader" writer="mysqlItemWriter"
                    commit-interval="2">
                </batch:chunk>
            </batch:tasklet>
        </batch:step>
    </batch:job>

    <bean id="cvsFileItemReader" class="org.springframework.batch.item.file.FlatFileItemReader">

        <!-- Read a csv file -->
        <property name="resource" value="classpath:cvs/report.csv" />

        <property name="lineMapper">
            <bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">

                <!-- split it -->
                <property name="lineTokenizer">
                    <bean
                        class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
                        <property name="names" value="userId,departmentId" />
                    </bean>
                </property>

                <property name="fieldSetMapper">

                    <!-- return back to reader, rather than a mapped object. -->
                    <!--
                        <bean class="org.springframework.batch.item.file.mapping.PassThroughFieldSetMapper" />
                    -->

                    <!-- map to an object -->
                    <bean
                        class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
                        <property name="prototypeBeanName" value="report" />
                    </bean>

                </property>

            </bean>
        </property>

    </bean>

    <bean id="mysqlItemWriter"
        class="org.springframework.batch.item.database.JdbcBatchItemWriter">
        <property name="dataSource" ref="dataSource" />
        <property name="sql">
            <value>
            <![CDATA[        
                insert into RAW_REPORT(userId,departmentId) values (:userId, :departmentId)
            ]]>
            </value>
        </property>
        <!-- It will take care matching between object property and sql name parameter -->
        <property name="itemSqlParameterSourceProvider">
            <bean
                class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
        </property>
    </bean>

</beans>

App.java It's main class for your job execution

package com.om;

import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobExecution;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class App {
    public static void main(String[] args) throws IllegalStateException {

        String[] springConfig  = 
            {   "spring/batch/config/database.xml", 
                "spring/batch/config/context.xml",
                "spring/batch/jobs/job-report.xml" 
            };

        ApplicationContext context = 
                new ClassPathXmlApplicationContext(springConfig);

        JobLauncher jobLauncher = (JobLauncher) context.getBean("jobLauncher");
        Job job = (Job) context.getBean("reportJob");

        try {

            JobExecution execution = jobLauncher.run(job, new JobParameters());
            System.out.println("Exit Status : " + execution.getStatus());

        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("Done");

    }
}

Report.java it's your Pojo

package com.om.model;

public class Report {

    private String userId;
    private String departmentId;
    public String getUserId() {
        return userId;
    }
    public void setUserId(String userId) {
        this.userId = userId;
    }
    public String getDepartmentId() {
        return departmentId;
    }
    public void setDepartmentId(String departmentId) {
        this.departmentId = departmentId;
    }


}

Now you need to put your report.csv into resource folder having millions of userId & departmentId.

You can simply see your database tables how job is executed along with database entries automatically. Please ask for any help required.

来源：https://stackoverflow.com/questions/49981821/batch-processing-files-and-diff-with-database

标签

mysql

spring-boot

batch-processing