Spring Batch querying with state changes

问题

I am using Spring Boot 1.5.7 with Spring Data JPA and Spring Batch. I use JpaPagingItemReader<T> to read entities and JpaItemWriter<T> to write them. What I am aiming to do, is read data from a certain database table, convert them to a different format and write them back to different tables (I read raw json strings, deserialize them and insert them to their specific tables).

I don't plan to delete the data I read after processing them, instead I just want to mark them as processed. The question is, will JpaPagingItemReader handle reads well, If I make the query to something like this:

    @Bean
    public ItemReader<RdJsonStore> reader(){
        JpaPagingItemReader<RdJsonStore> reader = new JpaPagingItemReader<>();
        reader.setEntityManagerFactory(entityManagerFactory);
        reader.setQueryString("select e from RdJsonStore e "+
                              "where e.jsonStoreProcessedPointer is null");
        reader.setPageSize(rawDataProperties.getBatchProcessingSize());
        return reader;
    }

So it would read only if there is no pointer to it. I would insert a pointer after processing an entry (in batches, like I process 1000 entry and post all their ids to the pointer table).

Can an ItemWriter (and the JPA one) handle the data read if I make changed to the returned data on the run like this (the entries it tries to query gets reduced with every batch)?

If the pointer solution is not applicable, how should I design the DB-to-DB batch job?

My source table looks like this:

回答1:

If you look at the code of JpaPagingItemReader , for method doReadPage() , you will notice this line,

Query query = createQuery().setFirstResult(getPage() * getPageSize()).setMaxResults(getPageSize());

where createQuery() is as,

private Query createQuery() {
        if (queryProvider == null) {
            return entityManager.createQuery(queryString);
        }
        else {
            return queryProvider.createQuery();
        }
    }

So you see that query is created / executed afresh for each page but page number is not recalculated as per new data set and page number recalculation doesn't make sense either.

getPageSize() always returns value set in configuration and getPage() returns last calculated page number ( previously processed page + 1 ) so if data is shrinking, your program will work correctly if page number calculation is also done afresh i.e. you always start with page = 0 and that doesn't happen with JpaPagingItemReader so you will loose data as specified by M Deinum in comments.

Also, as per my understanding addition of new data will work OK ( provided new records are added at the end as per sorting keys even though locking of data is usually assumed for during the job run ).

I think, marking a row as PROCESSED during the current job run serves no purpose since that is already taken care by framework ( as a record is not getting processed twice ) .

What you might need is marking a record as PROCESSED for Next Job Run and that can be handled by updating a separate flag which is not part of WHERE clause ( during job run ) and then at the end of the job - update a flag which is part of WHERE clause ( which you use in your WHERE clause to indicate about processed records ).

来源：https://stackoverflow.com/questions/46314159/spring-batch-querying-with-state-changes

标签

java

hibernate

spring-boot

spring-data-jpa

spring-batch