问题
Currently I am developing a Spring-Boot application that is periodically trying to process a file containing user data where every line contains userId and departamentId separated by | for example 123534|13. That file will be containing few milions of records.
My requirement is to load this data into mysql database in such manner that:
- If the user with processed ID exists, do not do anything
- If the user is not existing create new user
- If the user is not on the list but is present in the database, remove it
- If department with current is not present on the database, create it
I have made some optimization like
- Caching Departaments to fill entity
- Batch collecting Users to Save and saving it throuch
JpaRepositorysaveAllmethod
But Still I am making too many database calls to the database, I am checking if user is present in order to create entity for save for every record...
My Entities are rather simple:
@Entity
@Table(name = "departaments")
public class Departament{
@Id
@Column(name = "id")
private Long id;
@Column(name = "name")
private String name;
and:
@Entity
@Table(name = "users")
public class User {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
@Column(name = "id")
private Long id;
@ManyToOne
@JoinColumn(name = "departament_id")
private Departament departament;
Has anyone come across problem like that?
Could it be optimized more?
Is there any good pattern for processing?
回答1:
A couple of things here:
- It would seem your primary source of truth for users is the CSV file. Why not simply truncate and recreate the
USERtable? You may run into some problems (I understand that referential integrity is not one of them in your scenario - or is it?), but you'll get user deletion for free (TBH I can't quite picture how you handle user deletion in the current setup). It would run even faster with key checks disabled - Are you actually seeing performance improvements when using
saveAll? This does not limit the number ofSELECTstatements to execute - Are you sure you're operating on the right level of abstraction? Perhaps you could use plain JDBC instead of JPA. with JPA, there will be a lot of caching/mapping involved, causing significant overhead. With JDBC you could leverage MySQL's
INSERT IGNOREorINSERT ... ON DUPLICATE KEY UPDATEstatements to get what you want - If you go for any of the above, you could try using Spring Batch for more declarative processing
回答2:
If it is a "replacement", do this to avoid any downtime:
CREATE TABLE new LIKE old;
LOAD DATA INFILE ... (and any other massaging)
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
If it is a "delta", the LOAD it into a separate table, then execute suitable SQL statements to perform the updates. It would be approximately one SQL statement per bullet item in your Question. No loops.
回答3:
Just try to do it simply using Spring-batch
context.xml
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.2.xsd">
<!-- stored job-meta in database -->
<bean id="jobRepository"
class="org.springframework.batch.core.repository.support.JobRepositoryFactoryBean">
<property name="dataSource" ref="dataSource" />
<property name="transactionManager" ref="transactionManager" />
<property name="databaseType" value="mysql" />
</bean>
<bean id="jobLauncher"
class="org.springframework.batch.core.launch.support.SimpleJobLauncher">
<property name="jobRepository" ref="jobRepository" />
</bean>
</beans>
database.xml
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:jdbc="http://www.springframework.org/schema/jdbc"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.2.xsd
http://www.springframework.org/schema/jdbc
http://www.springframework.org/schema/jdbc/spring-jdbc-3.2.xsd">
<!-- connect to database -->
<bean id="dataSource"
class="org.springframework.jdbc.datasource.DriverManagerDataSource">
<property name="driverClassName" value="com.mysql.jdbc.Driver" />
<property name="url" value="jdbc:mysql://localhost:3306/test" />
<property name="username" value="root" />
<property name="password" value="User@1234" />
</bean>
<bean id="transactionManager"
class="org.springframework.batch.support.transaction.ResourcelessTransactionManager" />
<!-- create job-meta tables automatically -->
<jdbc:initialize-database data-source="dataSource">
<jdbc:script location="org/springframework/batch/core/schema-drop-mysql.sql" />
<jdbc:script location="org/springframework/batch/core/schema-mysql.sql" />
</jdbc:initialize-database>
</beans>
jobReport.xml
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:batch="http://www.springframework.org/schema/batch"
xmlns:task="http://www.springframework.org/schema/task"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/batch
http://www.springframework.org/schema/batch/spring-batch-2.2.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.2.xsd">
<bean id="report" class="com.om.model.Report" scope="prototype" />
<batch:job id="reportJob">
<batch:step id="step1">
<batch:tasklet>
<batch:chunk reader="cvsFileItemReader" writer="mysqlItemWriter"
commit-interval="2">
</batch:chunk>
</batch:tasklet>
</batch:step>
</batch:job>
<bean id="cvsFileItemReader" class="org.springframework.batch.item.file.FlatFileItemReader">
<!-- Read a csv file -->
<property name="resource" value="classpath:cvs/report.csv" />
<property name="lineMapper">
<bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">
<!-- split it -->
<property name="lineTokenizer">
<bean
class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
<property name="names" value="userId,departmentId" />
</bean>
</property>
<property name="fieldSetMapper">
<!-- return back to reader, rather than a mapped object. -->
<!--
<bean class="org.springframework.batch.item.file.mapping.PassThroughFieldSetMapper" />
-->
<!-- map to an object -->
<bean
class="org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper">
<property name="prototypeBeanName" value="report" />
</bean>
</property>
</bean>
</property>
</bean>
<bean id="mysqlItemWriter"
class="org.springframework.batch.item.database.JdbcBatchItemWriter">
<property name="dataSource" ref="dataSource" />
<property name="sql">
<value>
<![CDATA[
insert into RAW_REPORT(userId,departmentId) values (:userId, :departmentId)
]]>
</value>
</property>
<!-- It will take care matching between object property and sql name parameter -->
<property name="itemSqlParameterSourceProvider">
<bean
class="org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider" />
</property>
</bean>
</beans>
App.java It's main class for your job execution
package com.om;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobExecution;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;
public class App {
public static void main(String[] args) throws IllegalStateException {
String[] springConfig =
{ "spring/batch/config/database.xml",
"spring/batch/config/context.xml",
"spring/batch/jobs/job-report.xml"
};
ApplicationContext context =
new ClassPathXmlApplicationContext(springConfig);
JobLauncher jobLauncher = (JobLauncher) context.getBean("jobLauncher");
Job job = (Job) context.getBean("reportJob");
try {
JobExecution execution = jobLauncher.run(job, new JobParameters());
System.out.println("Exit Status : " + execution.getStatus());
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Done");
}
}
Report.java it's your Pojo
package com.om.model;
public class Report {
private String userId;
private String departmentId;
public String getUserId() {
return userId;
}
public void setUserId(String userId) {
this.userId = userId;
}
public String getDepartmentId() {
return departmentId;
}
public void setDepartmentId(String departmentId) {
this.departmentId = departmentId;
}
}
Now you need to put your report.csv into resource folder having millions of userId & departmentId.
You can simply see your database tables how job is executed along with database entries automatically. Please ask for any help required.
来源:https://stackoverflow.com/questions/49981821/batch-processing-files-and-diff-with-database