Postgresql - IN clause optimization for more than 3000 values

问题

I have an application where the user will be uploading an excel file(.xlsx or .csv) with more than 10,000 rows with a single column "partId" containing the values to look for in database

I will be reading the excel values and store it in list object and pass the list as parameter to the Spring Boot JPA repository find method that builds IN clause query internally:

                // Read excel file
                stream = new   ByteArrayInputStream(file.getBytes());
                wb = WorkbookFactory.create(stream);
                org.apache.poi.ss.usermodel.Sheet sheet = wb.getSheetAt(wb.getActiveSheetIndex());
                
                Iterator<Row> rowIterator = sheet.rowIterator();
                
                while(rowIterator.hasNext()) {
                    Row row = rowIterator.next();
                    Cell cell = row.getCell(0);
                    System.out.println(cell.getStringCellValue());
                    vinList.add(cell.getStringCellValue());
                }
                
                //JPA repository method that I used
                  findByPartIdInAndSecondaryId(List<String> partIds);

I read in many articles and experienced the same in above case that using IN query is inefficient for huge list of data.

How can I optimize the above scenario or write a new optimized query?

Also, please let me know if there is optimized way of reading an excel file than the above mentioned code snippet

It would be much helpful!! Thanks in advance!

回答1:

If the list is truly huge, you will never be lightning fast.

I see several options:

Send a query with a large IN list, as you mention in your question.

Construct a statement that is a join with a large VALUES clause:

SELECT ... FROM mytable
   JOIN (VALUES (42), (101), (43), ...) AS tmp(col)
      ON mytable.id = tmp.col;

Create a temporary table with the values and join with that:

BEGIN;
CREATE TEMP TABLE tmp(col bigint) ON COMMIT DROP;

Then either

COPY tmp FROM STDIN; -- if Spring supports COPY

INSERT INTO tmp VALUES (42), (101), (43), ...; -- if not

Then

ANALYZE tmp;  -- for good statistics
SELECT ... FROM mytable
   JOIN tmp ON mytable.id = tmp.col;
COMMIT;  -- drops the temporary table

Which of these is fastest is best determined by trial and error for your case; I don't think that it can be said that one of the methods will always beat the others.

Some considerations:

Solutions 1. and 2. may result in very large statements, while solution 3. can be split in smaller chunks.
Solution 3. will very likely be slower unless the list is truly large.

来源：https://stackoverflow.com/questions/64785993/postgresql-in-clause-optimization-for-more-than-3000-values

标签

excel

postgresql

spring-boot

spring-data-jpa